A tangled mess of threads...

**MutantJohn** · 08-26-2014

So, it's been awhile since I've come here for algorithm help, I realize.

Alright, let's pretend we have some number ( roughly >1000) of threads and they're trying to write sorted pairs of ints.

For the sake of example, assume 6 threads and each thread is trying to write a pair (to where will be explained) like this, for example :

(0, 1) - thread 0
(0, 5) - thread 1
(1, 2) - thread 2
(2, 3) - thread 3
(3, 4) - thread 4
(4, 5) - thread 5

This can easily be written to a global bank without conflicts but that's not quite what I want. I want to reduce the set so that each remaining pair has completely unique entries.

There is more than one permutation but it's fine so long as the output follows the same rules.

Here's one sample output :
(0, 1)
(2, 3)
(4, 5)

And here's another :
(0, 5)
(1, 2)
(3, 4)

It's all dependent upon which initial indices I say are unique.

This problem would be simple to brute-force on a CPU but on a GPU, it seems much more difficult. I've thought about modifying the write procedure to check for uniqueness but that sounds incredibly cumbersome.

How should I try to attack this problem?

**Mario F.** · 08-26-2014

The main thread should manage and allocate the unique numbers, threading as necessary.

**Nominal Animal** · 08-26-2014

Let me know if I completely misunderstood your problem at hand.

If the input set is always (0,1), (1,2), (2,3), .., (N-2, N-1), (N-1, 0) for N threads, do the even-odd ones ((0,1), (2,3), (4,5), and so on), say with an atomic counter (which you multiply by two to get the first number, add one and check for wraparound for the second number). If you want to cover all the pairs, just not pairs sharing integers at the same time, do the even-odd ones first, then the odd-even ones ((1,2), (3,4), (5,6), and so on) on a second pass.

In general, this is a simple graph coloring problem, where the graph itself is just a simple cycle of N vertices. No adjacent vertices are allowed to share the same color. In your case, you might be only interested in the most common color: for the simple N-node cycle case, this is two colors if N is even; three if N is odd (and only one node is of the third color).

In practice, floor(N/2) threads can work with unique adjacent integer pairs in [0, N-1] (assuming 0 and N-1 are adjacent too). If N is even, then you can cover the rest in a second pass. If N is odd, you'll need a third pass.

**MutantJohn** · 08-26-2014

The input will not always be so sequential. I should have specified that the pairs can be made of random integers but are always sorted or rather, they can be should I so choose. Also assume that the pairs themselves have no particular ordering.

Sorting is on the table. If sorting the set allows for an efficient algorithm then that'd be awesome too.

I'll look into this graph coloring algorithm though. I've never seen it before. Seems interesting, to say the least.

**Nominal Animal** · 08-26-2014

If you have a source of integer pairs i,j, where i,j ∈ ℕ, 0 ≤ i,j < N, and you are only interested in pairs where neither i nor j has participated in a pair yet, a simple way to decide whether to include or exclude a pair is to use an N-bit bitmap. Start with all zeros, and test each pair: if either of bits i or j is set, then discard the pair; otherwise set the bits and work on the pair.

Assuming there are many candidate pairs (say, 50% or more of pairs are rejected), I'd first use a lockless bit test (using atomic built-ins) to test if either has already been used in a pair yet. If not, I'd grab a mutex, re-check the bits, and set them if both were zero, finally releasing the mutex. (I do not immediately see a way to do this safely completely locklessly; this requires a transaction, since two individual operations are done, either of which may cause a rejection.)

If N is sufficiently large, the bitmap approach is unfeasible. In that case, I'd sort the triplets i,j,0 and j,i,1 (or pairs i,j and j,i if the pair order does not matter), removing all duplicates of the first integer. Duplicate the leftover array reversing each tuple (i,j⇒j,i, or i,j,k⇒j,i,1-k), and sort and remove duplicates as before. A merging pass (advancing along both arrays, skipping duplicate values in both arrays) should be enough to get rid of the remaining duplicates. This should yield an O(pairs log pairs) algorithm. For very large number of pairs (hundreds of millions of pairs), a radix sort is faster than the other sort algorithms, and will handle this in O(pairs) time.

Note that although Mario F. suggested having the main thread allocate the work per thread, I prefer the approach where each thread grabs the next available task/job/work itself, using lockless structures if possible. While this thread-pool-like approach is sometimes more complicated to implement, in my experience it results in more efficient CPU use compared to a master-thread based approach; worth the effort in high performance computing. Feel free to disagree, this is only my own opinion based on my own limited experience.

**Codeplug** · 08-26-2014

>> I do not immediately see a way to do this safely completely locklessly;
The fast path of a mutex transaction is typically a couple of "interlocked" operations already (with normal stores for the bits in between). An even better fast path is if the bits are in the same memory location

. The problems are caused by contention, which prevents the fast path from being taken.

>> let's pretend we have some number ( roughly >1000) of threads
I would instead pretend that I have 64 cores (nice). Then I would ask the OS how many cores I have access to and create that many worker threads (minus how ever many IO or support threads). The work would then be split up as evenly as possible between the work threads.

The more threads there are, the higher the potential for contention on synchronization objects. Try to minimize contention when designing.

gg

**MutantJohn** · 08-27-2014

Alright, I kind of brute-forced it. I'm not sure if it was highlighted in my OP (I did try to keep it hidden because I just wanted theoretical discussion of the algorithm) but this code is for a GPU so it's alright to use a butt-load of threads.

I used a much more niche hardware feature from nVidia but I hope the code translates into what I eventually settled with doing :

Code:

#include <stdio.h>
#include <assert.h>


enum color { white, blue, red };


struct pair
{
    int first, second;


    __host__ __device__
    pair(int a, int b) : first(a), second(b) { };
};


struct args
{
    pair *pr;
    int *color; // color table
    bool *tbl; // hash table
    int *anchor;
};


__global__
void child(pair *pr, int *color, bool *tbl, int *anchor)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;


    if (i < 6)
    {    
        if (color[i] == blue || color[i] == red)
            return;


        if (tbl[pr[i].first] || tbl[pr[i].second]) // check hash table
        {
            //printf("%d, %d\n", pr[i].first, pr[i].second);
            color[i] = blue;
            return;
        }


        atomicMin(anchor, i);
    }
}

// Parent GPU routine


__global__
void parent(args data)
{
    pair *pr = data.pr;
    int *color = data.color;
    bool *tbl = data.tbl;
    int *anchor = data.anchor;


    do
    {
       // set hash table values


        tbl[pr[*anchor].first] = 1; 
        tbl[pr[*anchor].second] = 1;
        color[*anchor] = red; // red = unique pair
        *anchor = INT_MAX; // for atomicMin()

        // launch the child routine
        // this is only possible on certain architectures


        child<<<1, 6>>>(pr, color, tbl, anchor);

        cudaDeviceSynchronize();


        //printf("New anchor %d\n", *anchor);
    } while (*anchor != INT_MAX);


    for (int i = 0; i < 6; ++i)
        if (color[i] == red)
            printf("%d, %d\n", pr[i].first, pr[i].second);
}


int main(void)
{
    pair *data = 0;
    int *color = 0;
    bool *tbl = 0;
    int *anchor = 0;


    cudaMallocManaged(&data, 6 * sizeof(*data));
    cudaMallocManaged(&color, 6 * sizeof(*color));
    cudaMallocManaged(&tbl, 6 * sizeof(*tbl));
    cudaMallocManaged(&anchor, sizeof(*anchor));

    // Build pair data by hand (super lazy, direct method)


    (data + 0)->first  = 0;
    (data + 0)->second = 1;


    (data + 1)->first  = 0;
    (data + 1)->second = 5;


    (data + 2)->first  = 1;
    (data + 2)->second = 2;


    (data + 3)->first  = 2;
    (data + 3)->second = 3;


    (data + 4)->first  = 3;
    (data + 4)->second = 4;


    (data + 5)->first  = 4;
    (data + 5)->second = 5;

    // everything starts out white (default value)


    for (int i = 0; i < 6; ++i)
        color[i] = white;

    // the hash table is initially all false (empty) and the anchor
   // is initialized to 0


    cudaMemset(tbl, 0, 6 * sizeof(*tbl));
    *anchor = 0;

    // build argument data structure


    args dta;
    dta.pr = data;
    dta.tbl = tbl;
    dta.color = color;
    dta.anchor = anchor;

    // this calls the GPU code. We are launching it with 1 thread.
   // in CUDA, threads are launched in blocks of threads.
   // here we use 1 block with 1 thread


    parent<<<1, 1>>>(dta);
    cudaDeviceSynchronize(); // sync to GPU

    // clean-up


    cudaFree(anchor);
    cudaFree(tbl);
    cudaFree(color);
    cudaFree(data);


    return 0;
}

Basically, we color all unique pairs red and all non-unique ones blue. We define red pairs by where the next anchor lands (minimum index of currently non-red and non-blue indices, i.e. is still white).

Thread: A tangled mess of threads...

Thread Tools

Search Thread

Display

A tangled mess of threads...

Similar Threads

threads - threadpool - reset threads - stop threads without destroy - pthreads -linux

Make a mess

My mess of sub calls!

Help me fix my mess!!!!

what is the mess about return 0