Thread: Binary Radix Sort

  1. #1
    Registered User
    Join Date
    Mar 2020
    Posts
    11

    Binary Radix Sort

    Hi,

    I need some help with implementing a Radix Sort Algorithm. I found a related thread posted by @rcgldr (don't know how to @user) that although really useful it is now closed.
    My question is if the code presented there is the best available alternative or if there is a C or C++ library I should use instead.

    I also found this implementation on Github but I cannot make it run correctly.

    Any information would be really appreciated.

    Thanks in advance!

  2. #2
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    The code in the prior thread could use some minor tweaks to make it a bit faster, but it's at a point of diminishing returns. As pointed out in the prior thread, if sorting 64 bit unsigned integers, you could reduce from 8 passes to 6 passes using (11,11,11,11,10,10) bit field sizes instead of (8,8,8,8,8,8,8,8) bit field sizes, but in my testing, I didn't see much improvement in time.

    Another option is for the first radix pass to be done on the most significant bits, creating 256 (or 1024 or 2048) bins, on a array much larger than the total cache. The ideas is that each of the bins will fit at least within the L3 cache (8 MB on my 3770k) processor, but it only reduced time by 5%. Example code below.

    There is an issue with benchmarks with the 3770K (3.5 ghz) processor. It's really sensitive to alignment. When doing assembly code, I've found that adding a few aligns to 16 byte boundary, sometimes an even boundary, sometimes and odd boundary, makes a significant difference in run time. In a question I posted at another forum, the alignment of the calling code and the alignment of a tight loop, created a range of time from 1.465 seconds to 2.000 seconds, a 36.5% increase in time due to alignment done via nops between functions:

    performance - Indexed branch overhead on X86 64 bit mode - Stack Overflow

    In the C programming prior thread about radix sort, there are some posts about random number generators, such as xor + shift sequences versus multiply + add sequences, but multiply on processors even back then only take 3 cycles, and random number generators based on multiply + add are fairly common.

    Linear congruential generator - Wikipedia

    Example code for sorting 36 million psuedo random 32 bit integers, a hybrid radix sort that sorts by the most significant 8 bits to create 256 bins, followed by 256 (8,8,8) bit field radix sorts to sort each bin. Even though each of the 256 bins is 562500 bytes in size, easily fitting within L3 cache, it's only about a 5% improvement over a (8,8,8,8) bit field radix sort (from 0.37 seconds down to 0.35 seconds). The conditional QP enables usage of Windows QueryPerformanceCounter for timing.

    Code:
    //      rsort32ml.cpp
    #include <fstream>
    #include <iostream>
    #include <cstdlib>
    #include <ctime>
    
    #define COUNT (6000*6000)            // number of values to sort
    
    #define QP 1        // if != 0, use queryperformance for timer
    
    #if QP
    #include <math.h>
    #include <windows.h>
    #pragma comment(lib, "winmm.lib")
    typedef LARGE_INTEGER LI64;
    #endif
    
    #if QP
    static LI64     liQPFrequency;  // cpu counter values
    static LI64     liStartTime;
    static LI64     liStopTime;
    static double   dQPFrequency;
    static double   dStartTime;
    static double   dStopTime;
    static double   dElapsedTime;
    #else
    clock_t ctTimeStart;            // clock values
    clock_t ctTimeStop;
    #endif
    
    typedef unsigned int uint32_t;
    
    // sort a bin by 3 least significant bytes
    void RadixSort3(uint32_t * a, uint32_t *b, size_t count)
    {
    size_t mIndex[3][256] = {0};            // count / matrix
    size_t i,j,m,n;
    uint32_t u;
        if(count == 0)
            return;
        for(i = 0; i < count; i++){         // generate histograms
            u = a[i];
            for(j = 0; j < 3; j++){
                mIndex[j][(size_t)(u & 0xff)]++;
                u >>= 8;
            }       
        }
        for(j = 0; j < 3; j++){             // convert to indices
            m = 0;
            for(i = 0; i < 256; i++){
                n = mIndex[j][i];
                mIndex[j][i] = m;
                m += n;
            }       
        }
        for(j = 0; j < 3; j++){             // radix sort
            for(i = 0; i < count; i++){     //  sort by current lsb
                u = a[i];
                m = (size_t)(u>>(j<<3))&0xff;
                b[mIndex[j][m]++] = u;
            }
            std::swap(a, b);                //  swap ptrs
        }
    }
    
    // split array into 256 bins according to most significant byte
    void RadixSort(uint32_t * a, uint32_t*b, size_t count)
    {
    size_t aIndex[260] = {0};               // count / array
    size_t i;
        for(i = 0; i < count; i++)          // generate histogram
            aIndex[1+((size_t)(a[i] >> 24))]++;
        for(i = 2; i < 257; i++)            // convert to indices
            aIndex[i] += aIndex[i-1];
        for(i = 0; i < count; i++)          //  sort by msb
            b[aIndex[a[i]>>24]++] = a[i];
        for(i = 256; i; i--)                // restore aIndex
            aIndex[i] = aIndex[i-1];
        aIndex[0] = 0;
        for(i = 0; i < 256; i++)            // radix sort the 256 bins
            RadixSort3(&b[aIndex[i]], &a[aIndex[i]], aIndex[i+1]-aIndex[i]);
    }
    
    int main( )
    {
        uint32_t * a = new uint32_t [COUNT];    // allocate data array
        uint32_t * b = new uint32_t [COUNT];    // allocate temp array
        size_t i;
        uint32_t u;
    //      generate data
        for(i = 0; i < COUNT; i++){
            u  = (((uint32_t)((rand()>>4) & 0xff))<< 0);
            u += (((uint32_t)((rand()>>4) & 0xff))<< 8);
            u += (((uint32_t)((rand()>>4) & 0xff))<<16);
            u += (((uint32_t)((rand()>>4) & 0xff))<<24);
            a[i] = u;
        }
    
    #if QP
        QueryPerformanceFrequency(&liQPFrequency);
        dQPFrequency = (double)liQPFrequency.QuadPart;
        timeBeginPeriod(1);                     // set ticker to 1000 hz
        Sleep(128);                             // wait for it to settle
        QueryPerformanceCounter(&liStartTime);
    #else
        ctTimeStart = clock();
    #endif
    
    //      sort data
        RadixSort(a, b, COUNT);
    
    #if QP
        QueryPerformanceCounter(&liStopTime);
        dStartTime = (double)liStartTime.QuadPart;
        dStopTime  = (double)liStopTime.QuadPart;
        dElapsedTime = (dStopTime - dStartTime) / dQPFrequency;
        timeEndPeriod(1);                       // restore ticker to default
        std::cout << "# of seconds " << dElapsedTime << std::endl;
    #else
        ctTimeStop = clock();
        std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    #endif
    
        for(i = 1; i < COUNT; i++){
            if(a[i] < a[i-1]){
                std::cout << "failed" << std::endl;
                break;}}
        if(i == COUNT)
            std::cout << "passed" << std::endl;
    
        delete[] b;
        delete[] a;
        return 0;
    }
    Last edited by rcgldr; 04-01-2020 at 01:19 PM.

  3. #3
    Registered User
    Join Date
    Mar 2020
    Posts
    11
    First of all thanks for sharing your code and knowledge, it's been a huge help for me.

    I ran some tests using 107 unsigned 32 bit integers and your serial code outperforms bash sort parallel implementation by almost an order of magnitude. It is quiet impressive.
    I modified your original code so that it can sort records of variable size, X bytes, using only Y bytes as a key (Y<X). This is mainly to handle tables with multiple columns, where you sort with respect to only one.
    The code is based on your radix sort algorithm from the prior thread.

    Code:
    unsigned char * RadixSort(unsigned char * pData, std::size_t count, std::size_t record_size, std::size_t key_size, std::size_t key_offset = 0)
    {
        typedef uint8_t k_t [key_size];
        typedef uint8_t record_t [record_size];
    
    
        size_t mIndex[key_size][256] = {0};            /* index matrix */
        unsigned char * pTemp = new unsigned char [count*sizeof(record_t)];
        unsigned char * pDst, * pSrc;
        size_t i,j,m,n;
        k_t u;
        record_t v;
    
        for(i = 0; i < count; i++)                  /* generate histograms */
        {         
            std::memcpy(&u, (pData + record_size*i + key_offset), sizeof(u));
            
            for(j = 0; j < key_size; j++)
            {
                mIndex[j][(size_t)(u[j])]++;
                // mIndex[j][(size_t)(u & 0xff)]++;
                // u >>= 8;
            }       
        }
        for(j = 0; j < key_size; j++)             /* convert to indices */
        {
            n = 0;
            for(i = 0; i < 256; i++)
            {
                m = mIndex[j][i];
                mIndex[j][i] = n;
                n += m;
            }       
        }
    
        pDst = pTemp;                       /* radix sort */
        pSrc = pData;
        for(j = 0; j < key_size; j++)
        {
            for(i = 0; i < count; i++)
            {
                std::memcpy(&u, (pSrc + record_size*i + key_offset), sizeof(u));
                std::memcpy(&v, (pSrc + record_size*i), sizeof(v));
                m = (size_t)(u[j]);
                // m = (size_t)(u >> (j<<3)) & 0xff;
                std::memcpy(pDst + record_size*mIndex[j][m]++, &v, sizeof(v));
            }
            std::swap(pSrc, pDst);
            
        }
        return(pSrc);
    }
    Any criticism/optimization to my code is more than welcome as I am far from being a good C/C++ programmer.
    For starters, in each radix pass I move around the whole record which maybe is not so efficient.

    the alignment of the calling code and the alignment of a tight loop, created a range of time from 1.465 seconds to 2.000 seconds, a 36.5% increase in time due to alignment done via nops between functions
    This is also surprising for me, I didn't know the alignment of the code/loop could create such a difference. I am also a bit new to this kind of optimizations, do you know a good reference about this matter?

    One last question, a parallel implementation of the code seems more or less straight forward. Do you think it is worth the effort?

    Sorry for the long reply. In any case with the information you provide me I am more than happy.
    Last edited by Fede; 04-01-2020 at 09:00 PM.

  4. #4
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by Fede View Post
    A parallel implementation of the code seems more or less straight forward.
    I modded a radix sort for 36 million psuedo random 64 bit integers, using (8,8,8,8,8,8,8,8) bit fields (8 passes), to use 4 threads corresponding to the 4 cores on my processor (Intel 3770K). The initial pass sorts by the most significant 8 bits, creating 256 logical bins. Then the 256 bins are sorted 4 at a time in parallel (multi-threading). The 4 thread version is 2.5 times as fast as the single thread version. I used Windows specific semaphores and threads for this. Probably the main advantage of the 4 threaded radix sort is that is uses all 4 core's L1 and L2 cache (the L3 cache and main memory are shared).

  5. #5
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Example C++ code for Windows. Converting the code to C is simple, but more effort would be needed to convert to Posix semaphores and threads. This example takes about 0.4 seconds to sort 36 million pseudo random 64 bit integers, versus about 1.0 seconds for a non-threaded version.

    Code:
    #include <iostream>
    #include <windows.h>
    
    #define COUNT (6000*6000)            // number of values to sort
    
    #define QP 1        // if != 0, use queryperformance for timer
    
    #if QP
    #include <math.h>
    #pragma comment(lib, "winmm.lib")
    typedef LARGE_INTEGER LI64;
    #else
    #include <ctime>
    #endif
    
    #if QP
    static LI64     liQPFrequency;  // cpu counter values
    static LI64     liStartTime;
    static LI64     liStopTime;
    static double   dQPFrequency;
    static double   dStartTime;
    static double   dStopTime;
    static double   dElapsedTime;
    #else
    clock_t ctTimeStart;            // clock values
    clock_t ctTimeStop;
    #endif
    
    typedef unsigned       int uint32_t;
    typedef unsigned long long uint64_t;
    
    static HANDLE hs0;                      // semaphore handles
    static HANDLE hs1;
    static HANDLE hs2;
    static HANDLE hs3;
    
    static HANDLE ht1;                      // thread handles
    static HANDLE ht2;
    static HANDLE ht3;
    
    static uint64_t *pa, *pb;               // ptrs to arrays
    static uint32_t aIndex[260];            // start + end indexes, 256 bins
    static uint32_t aIi;                    // current index for 4 bins
    
    static DWORD WINAPI Thread0(LPVOID lpvoid);
    
    void RadixSort(uint64_t * a, uint64_t *b, size_t count)
    {
    uint32_t i,m,n;
    uint64_t u;
        for(i = 0; i < count; i++)          // generate histogram
            aIndex[(size_t)(a[i] >> 56)]++;
        m = 0;                              // convert to indexes
        for(i = 0; i < 257; i++){           //  including end of bin[255]
            n = aIndex[i];
            aIndex[i] = m;
            m += n;
        }
        for(i = 0; i < count; i++){         // sort by msb
            u = a[i];
            m = (uint32_t)(u >> 56);
            b[aIndex[m]++] = u;
        }
        for(i = 256; i; i--)                // restore aIndex
            aIndex[i] = aIndex[i-1];
        aIndex[0] = 0;
        for(aIi = 0; aIi < 256; aIi += 4){
            ReleaseSemaphore(hs1, 1, NULL);     // start threads
            ReleaseSemaphore(hs2, 1, NULL);
            ReleaseSemaphore(hs3, 1, NULL);
            Thread0(NULL);
            WaitForSingleObject(hs0, INFINITE); // wait for threads done
            WaitForSingleObject(hs0, INFINITE);
            WaitForSingleObject(hs0, INFINITE);
        }        
    }
    
    void RadixSort7(uint64_t * a, uint64_t *b, size_t count)
    {
    uint32_t mIndex[7][256] = {0};          // count / index matrix
    uint32_t i,j,m,n;
    uint64_t u;
        for(i = 0; i < count; i++){         // generate histograms
            u = a[i];
            for(j = 0; j < 7; j++){
                mIndex[j][(size_t)(u & 0xff)]++;
                u >>= 8;
            }
        }
        for(j = 0; j < 7; j++){             // convert to indices
            m = 0;
            for(i = 0; i < 256; i++){
                n = mIndex[j][i];
                mIndex[j][i] = m;
                m += n;
            }       
        }
        for(j = 0; j < 7; j++){             // radix sort
            for(i = 0; i < count; i++){     //  sort by current lsb
                u = a[i];
                m = (size_t)(u>>(j<<3))&0xff;
                b[mIndex[j][m]++] = u;
            }
            std::swap(a, b);                //  swap ptrs
        }
    }
    
    static DWORD WINAPI Thread0(LPVOID lpvoid)
    {
        RadixSort7(pb + aIndex[aIi+0], pa + aIndex[aIi+0], aIndex[aIi+1]-aIndex[aIi+0]);
        return 0;
    }
    
    static DWORD WINAPI Thread1(LPVOID lpvoid)
    {
        while(1){
            WaitForSingleObject(hs1, INFINITE); // wait for semaphore
            RadixSort7(pb + aIndex[aIi+1], pa + aIndex[aIi+1], aIndex[aIi+2]-aIndex[aIi+1]);
            ReleaseSemaphore(hs0, 1, NULL);     // indicate done
            if(aIi == 252)                      // exit if all bins done
                return 0;
        }
    }
    
    static DWORD WINAPI Thread2(LPVOID lpvoid)
    {
        while(1){
            WaitForSingleObject(hs2, INFINITE); // wait for semaphore
            RadixSort7(pb + aIndex[aIi+2], pa + aIndex[aIi+2], aIndex[aIi+3]-aIndex[aIi+2]);
            ReleaseSemaphore(hs0, 1, NULL);     // indicate done
            if(aIi == 252)                      // exit if all bins done
                return 0;
        }
    }
    
    static DWORD WINAPI Thread3(LPVOID lpvoid)
    {
        while(1){
            WaitForSingleObject(hs3, INFINITE); // wait for semaphore
            RadixSort7(pb + aIndex[aIi+3], pa + aIndex[aIi+3], aIndex[aIi+4]-aIndex[aIi+3]);
            ReleaseSemaphore(hs0, 1, NULL);     // indicate done
            if(aIi == 252)                      // exit if all bins done
                return 0;
        }
    }
    
    uint64_t rnd64()                        // random 64 bit integer
    {
    static uint64_t r = 1ull;
        r = r * 6364136223846793005ull + 1442695040888963407ull;
        return r;
    }
    
    int main( )
    {
        uint64_t * a = new uint64_t [COUNT];    // allocate data array
        uint64_t * b = new uint64_t [COUNT];    // allocate temp array
        size_t i;
        pa = a;                                 // set global ptrs
        pb = b;
    
        hs0 = CreateSemaphore(NULL,0,7,NULL);   // create semaphores
        hs1 = CreateSemaphore(NULL,0,1,NULL);
        hs2 = CreateSemaphore(NULL,0,1,NULL);
        hs3 = CreateSemaphore(NULL,0,1,NULL);
    
        ht1 = CreateThread(NULL, 0, Thread1, 0, 0, 0);  // create threads
        ht2 = CreateThread(NULL, 0, Thread2, 0, 0, 0);
        ht3 = CreateThread(NULL, 0, Thread3, 0, 0, 0);
    
        for(i = 0; i < COUNT; i++){             // generate data
            a[i] = rnd64();
        }
    
    #if QP
        QueryPerformanceFrequency(&liQPFrequency);
        dQPFrequency = (double)liQPFrequency.QuadPart;
        timeBeginPeriod(1);                     // set ticker to 1000 hz
        Sleep(128);                             // wait for it to settle
        QueryPerformanceCounter(&liStartTime);
    #else
        ctTimeStart = clock();
    #endif
    
        RadixSort(a, b, COUNT);
    
    #if QP
        QueryPerformanceCounter(&liStopTime);
        dStartTime = (double)liStartTime.QuadPart;
        dStopTime  = (double)liStopTime.QuadPart;
        dElapsedTime = (dStopTime - dStartTime) / dQPFrequency;
        timeEndPeriod(1);                       // restore ticker to default
        std::cout << "# of seconds " << dElapsedTime << std::endl;
    #else
        ctTimeStop = clock();
        std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    #endif
    
        WaitForSingleObject(ht1, INFINITE); // wait for threads
        WaitForSingleObject(ht2, INFINITE);
        WaitForSingleObject(ht3, INFINITE);
        CloseHandle(ht1);                   // close threads
        CloseHandle(ht2);
        CloseHandle(ht3);
        CloseHandle(hs0);                   // close semaphores
        CloseHandle(hs1);
        CloseHandle(hs2);
        CloseHandle(hs3);
        
        for(i = 1; i < COUNT; i++){
            if(a[i] < a[i-1]){
                break;}}
        if(i == COUNT)
            std::cout << "passed" << std::endl;
        else
            std::cout << "failed" << std::endl;
    
        delete[] b;
        delete[] a;
        return 0;
    }
    Last edited by rcgldr; 04-02-2020 at 01:35 AM.

  6. #6
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by Fede View Post
    alignment of the code/loop could create such a difference. I am also a bit new to this kind of optimizations, do you know a good reference about this matter?
    I forgot to reply to this before. It's possible that the amount of sensitivity to alignment may be specific to the Intel 3770K or any of the third generation of processors known as "Ivy Bridge", which were first released back in 2012. Later processors may not have that issue, or at least there won't be such a difference in performance due to alignment. I'm not sure of a good reference for this, since these issues are processor or processor generation specific. There is another thread about this at that other forum for second generation "Sandy Bridge" ("SnB") processors, with links to prior related threads. The threads mention tools, some created by the people involved in those threads, used to report on the individual performance related aspects of a processor. These threads will give you an idea of what is involved in finding out about such issues.

    performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs - Stack Overflow

    That thread links to a prior thread, also about alignment:

    Performance optimisations of x86-64 assembly - Alignment and branch prediction - Stack Overflow
    Last edited by rcgldr; 04-02-2020 at 10:49 AM.

  7. #7
    Registered User
    Join Date
    Mar 2020
    Posts
    11
    As soon as I have time I'll make a version of the threaded code using gnu pthread library and post it here just in case it is useful for someone.
    Once again thanks for all the really useful information!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Radix Sort ( + counting sort) does not stable sort.
    By siepet in forum C Programming
    Replies: 6
    Last Post: 11-26-2013, 12:04 PM
  2. radix sort
    By rcgldr in forum C Programming
    Replies: 9
    Last Post: 08-20-2013, 05:44 PM
  3. radix sort and radix exchange sort.
    By whatman in forum C Programming
    Replies: 1
    Last Post: 07-31-2003, 12:24 PM
  4. Radix Sort
    By Trauts in forum C++ Programming
    Replies: 15
    Last Post: 03-06-2003, 04:46 PM
  5. Radix sort.
    By Drew in forum C++ Programming
    Replies: 1
    Last Post: 09-09-2001, 04:48 AM

Tags for this Thread