Thread: A tangled mess of threads...

  1. #1
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665

    A tangled mess of threads...

    So, it's been awhile since I've come here for algorithm help, I realize.

    Alright, let's pretend we have some number ( roughly >1000) of threads and they're trying to write sorted pairs of ints.

    For the sake of example, assume 6 threads and each thread is trying to write a pair (to where will be explained) like this, for example :

    (0, 1) - thread 0
    (0, 5) - thread 1
    (1, 2) - thread 2
    (2, 3) - thread 3
    (3, 4) - thread 4
    (4, 5) - thread 5

    This can easily be written to a global bank without conflicts but that's not quite what I want. I want to reduce the set so that each remaining pair has completely unique entries.

    There is more than one permutation but it's fine so long as the output follows the same rules.

    Here's one sample output :
    (0, 1)
    (2, 3)
    (4, 5)

    And here's another :
    (0, 5)
    (1, 2)
    (3, 4)

    It's all dependent upon which initial indices I say are unique.

    This problem would be simple to brute-force on a CPU but on a GPU, it seems much more difficult. I've thought about modifying the write procedure to check for uniqueness but that sounds incredibly cumbersome.

    How should I try to attack this problem?

  2. #2
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    The main thread should manage and allocate the unique numbers, threading as necessary.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  3. #3
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Let me know if I completely misunderstood your problem at hand.

    If the input set is always (0,1), (1,2), (2,3), .., (N-2, N-1), (N-1, 0) for N threads, do the even-odd ones ((0,1), (2,3), (4,5), and so on), say with an atomic counter (which you multiply by two to get the first number, add one and check for wraparound for the second number). If you want to cover all the pairs, just not pairs sharing integers at the same time, do the even-odd ones first, then the odd-even ones ((1,2), (3,4), (5,6), and so on) on a second pass.

    In general, this is a simple graph coloring problem, where the graph itself is just a simple cycle of N vertices. No adjacent vertices are allowed to share the same color. In your case, you might be only interested in the most common color: for the simple N-node cycle case, this is two colors if N is even; three if N is odd (and only one node is of the third color).

    In practice, floor(N/2) threads can work with unique adjacent integer pairs in [0, N-1] (assuming 0 and N-1 are adjacent too). If N is even, then you can cover the rest in a second pass. If N is odd, you'll need a third pass.

  4. #4
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    The input will not always be so sequential. I should have specified that the pairs can be made of random integers but are always sorted or rather, they can be should I so choose. Also assume that the pairs themselves have no particular ordering.

    Sorting is on the table. If sorting the set allows for an efficient algorithm then that'd be awesome too.

    I'll look into this graph coloring algorithm though. I've never seen it before. Seems interesting, to say the least.

  5. #5
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    If you have a source of integer pairs i,j, where i,j ∈ ℕ, 0 ≤ i,j < N, and you are only interested in pairs where neither i nor j has participated in a pair yet, a simple way to decide whether to include or exclude a pair is to use an N-bit bitmap. Start with all zeros, and test each pair: if either of bits i or j is set, then discard the pair; otherwise set the bits and work on the pair.

    Assuming there are many candidate pairs (say, 50% or more of pairs are rejected), I'd first use a lockless bit test (using atomic built-ins) to test if either has already been used in a pair yet. If not, I'd grab a mutex, re-check the bits, and set them if both were zero, finally releasing the mutex. (I do not immediately see a way to do this safely completely locklessly; this requires a transaction, since two individual operations are done, either of which may cause a rejection.)

    If N is sufficiently large, the bitmap approach is unfeasible. In that case, I'd sort the triplets i,j,0 and j,i,1 (or pairs i,j and j,i if the pair order does not matter), removing all duplicates of the first integer. Duplicate the leftover array reversing each tuple (i,jj,i, or i,j,kj,i,1-k), and sort and remove duplicates as before. A merging pass (advancing along both arrays, skipping duplicate values in both arrays) should be enough to get rid of the remaining duplicates. This should yield an O(pairs log pairs) algorithm. For very large number of pairs (hundreds of millions of pairs), a radix sort is faster than the other sort algorithms, and will handle this in O(pairs) time.

    Note that although Mario F. suggested having the main thread allocate the work per thread, I prefer the approach where each thread grabs the next available task/job/work itself, using lockless structures if possible. While this thread-pool-like approach is sometimes more complicated to implement, in my experience it results in more efficient CPU use compared to a master-thread based approach; worth the effort in high performance computing. Feel free to disagree, this is only my own opinion based on my own limited experience.

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> I do not immediately see a way to do this safely completely locklessly;
    The fast path of a mutex transaction is typically a couple of "interlocked" operations already (with normal stores for the bits in between). An even better fast path is if the bits are in the same memory location . The problems are caused by contention, which prevents the fast path from being taken.

    >> let's pretend we have some number ( roughly >1000) of threads
    I would instead pretend that I have 64 cores (nice). Then I would ask the OS how many cores I have access to and create that many worker threads (minus how ever many IO or support threads). The work would then be split up as evenly as possible between the work threads.

    The more threads there are, the higher the potential for contention on synchronization objects. Try to minimize contention when designing.

    gg

  7. #7
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Alright, I kind of brute-forced it. I'm not sure if it was highlighted in my OP (I did try to keep it hidden because I just wanted theoretical discussion of the algorithm) but this code is for a GPU so it's alright to use a butt-load of threads.

    I used a much more niche hardware feature from nVidia but I hope the code translates into what I eventually settled with doing :
    Code:
    #include <stdio.h>
    #include <assert.h>
    
    
    enum color { white, blue, red };
    
    
    struct pair
    {
        int first, second;
    
    
        __host__ __device__
        pair(int a, int b) : first(a), second(b) { };
    };
    
    
    struct args
    {
        pair *pr;
        int *color; // color table
        bool *tbl; // hash table
        int *anchor;
    };
    
    
    __global__
    void child(pair *pr, int *color, bool *tbl, int *anchor)
    {
        int i = threadIdx.x + blockIdx.x * blockDim.x;
    
    
        if (i < 6)
        {    
            if (color[i] == blue || color[i] == red)
                return;
    
    
            if (tbl[pr[i].first] || tbl[pr[i].second]) // check hash table
            {
                //printf("%d, %d\n", pr[i].first, pr[i].second);
                color[i] = blue;
                return;
            }
    
    
            atomicMin(anchor, i);
        }
    }
    
    // Parent GPU routine
    
    
    __global__
    void parent(args data)
    {
        pair *pr = data.pr;
        int *color = data.color;
        bool *tbl = data.tbl;
        int *anchor = data.anchor;
    
    
        do
        {
           // set hash table values
    
    
            tbl[pr[*anchor].first] = 1; 
            tbl[pr[*anchor].second] = 1;
            color[*anchor] = red; // red = unique pair
            *anchor = INT_MAX; // for atomicMin()
    
            // launch the child routine
            // this is only possible on certain architectures
    
    
            child<<<1, 6>>>(pr, color, tbl, anchor);
    
            cudaDeviceSynchronize();
    
    
            //printf("New anchor %d\n", *anchor);
        } while (*anchor != INT_MAX);
    
    
        for (int i = 0; i < 6; ++i)
            if (color[i] == red)
                printf("%d, %d\n", pr[i].first, pr[i].second);
    }
    
    
    int main(void)
    {
        pair *data = 0;
        int *color = 0;
        bool *tbl = 0;
        int *anchor = 0;
    
    
        cudaMallocManaged(&data, 6 * sizeof(*data));
        cudaMallocManaged(&color, 6 * sizeof(*color));
        cudaMallocManaged(&tbl, 6 * sizeof(*tbl));
        cudaMallocManaged(&anchor, sizeof(*anchor));
    
        // Build pair data by hand (super lazy, direct method)
    
    
        (data + 0)->first  = 0;
        (data + 0)->second = 1;
    
    
        (data + 1)->first  = 0;
        (data + 1)->second = 5;
    
    
        (data + 2)->first  = 1;
        (data + 2)->second = 2;
    
    
        (data + 3)->first  = 2;
        (data + 3)->second = 3;
    
    
        (data + 4)->first  = 3;
        (data + 4)->second = 4;
    
    
        (data + 5)->first  = 4;
        (data + 5)->second = 5;
    
        // everything starts out white (default value)
    
    
        for (int i = 0; i < 6; ++i)
            color[i] = white;
    
        // the hash table is initially all false (empty) and the anchor
       // is initialized to 0
    
    
        cudaMemset(tbl, 0, 6 * sizeof(*tbl));
        *anchor = 0;
    
        // build argument data structure
    
    
        args dta;
        dta.pr = data;
        dta.tbl = tbl;
        dta.color = color;
        dta.anchor = anchor;
    
        // this calls the GPU code. We are launching it with 1 thread.
       // in CUDA, threads are launched in blocks of threads.
       // here we use 1 block with 1 thread
    
    
        parent<<<1, 1>>>(dta);
        cudaDeviceSynchronize(); // sync to GPU
    
        // clean-up
    
    
        cudaFree(anchor);
        cudaFree(tbl);
        cudaFree(color);
        cudaFree(data);
    
    
        return 0;
    }
    Basically, we color all unique pairs red and all non-unique ones blue. We define red pairs by where the next anchor lands (minimum index of currently non-red and non-blue indices, i.e. is still white).

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 22
    Last Post: 12-14-2012, 11:00 AM
  2. Make a mess
    By mike_g in forum A Brief History of Cprogramming.com
    Replies: 27
    Last Post: 03-25-2008, 09:31 AM
  3. My mess of sub calls!
    By Queatrix in forum Windows Programming
    Replies: 4
    Last Post: 02-18-2006, 04:31 PM
  4. Help me fix my mess!!!!
    By Starr in forum C++ Programming
    Replies: 35
    Last Post: 02-01-2006, 03:40 PM
  5. what is the mess about return 0
    By just me in forum C Programming
    Replies: 6
    Last Post: 09-04-2002, 03:04 PM