Binary Radix Sort

**Fede** · 03-31-2020

Hi,

I need some help with implementing a Radix Sort Algorithm. I found a related thread posted by @rcgldr (don't know how to @user) that although really useful it is now closed.
My question is if the code presented there is the best available alternative or if there is a C or C++ library I should use instead.

I also found this implementation on Github but I cannot make it run correctly.

Any information would be really appreciated.

Thanks in advance!

**rcgldr** · 04-01-2020

The code in the prior thread could use some minor tweaks to make it a bit faster, but it's at a point of diminishing returns. As pointed out in the prior thread, if sorting 64 bit unsigned integers, you could reduce from 8 passes to 6 passes using (11,11,11,11,10,10) bit field sizes instead of (8,8,8,8,8,8,8,8) bit field sizes, but in my testing, I didn't see much improvement in time.

Another option is for the first radix pass to be done on the most significant bits, creating 256 (or 1024 or 2048) bins, on a array much larger than the total cache. The ideas is that each of the bins will fit at least within the L3 cache (8 MB on my 3770k) processor, but it only reduced time by 5%. Example code below.

There is an issue with benchmarks with the 3770K (3.5 ghz) processor. It's really sensitive to alignment. When doing assembly code, I've found that adding a few aligns to 16 byte boundary, sometimes an even boundary, sometimes and odd boundary, makes a significant difference in run time. In a question I posted at another forum, the alignment of the calling code and the alignment of a tight loop, created a range of time from 1.465 seconds to 2.000 seconds, a 36.5% increase in time due to alignment done via nops between functions:

performance - Indexed branch overhead on X86 64 bit mode - Stack Overflow

In the C programming prior thread about radix sort, there are some posts about random number generators, such as xor + shift sequences versus multiply + add sequences, but multiply on processors even back then only take 3 cycles, and random number generators based on multiply + add are fairly common.

Linear congruential generator - Wikipedia

Example code for sorting 36 million psuedo random 32 bit integers, a hybrid radix sort that sorts by the most significant 8 bits to create 256 bins, followed by 256 (8,8,8) bit field radix sorts to sort each bin. Even though each of the 256 bins is 562500 bytes in size, easily fitting within L3 cache, it's only about a 5% improvement over a (8,8,8,8) bit field radix sort (from 0.37 seconds down to 0.35 seconds). The conditional QP enables usage of Windows QueryPerformanceCounter for timing.

Code:

//      rsort32ml.cpp
#include <fstream>
#include <iostream>
#include <cstdlib>
#include <ctime>

#define COUNT (6000*6000)            // number of values to sort

#define QP 1        // if != 0, use queryperformance for timer

#if QP
#include <math.h>
#include <windows.h>
#pragma comment(lib, "winmm.lib")
typedef LARGE_INTEGER LI64;
#endif

#if QP
static LI64     liQPFrequency;  // cpu counter values
static LI64     liStartTime;
static LI64     liStopTime;
static double   dQPFrequency;
static double   dStartTime;
static double   dStopTime;
static double   dElapsedTime;
#else
clock_t ctTimeStart;            // clock values
clock_t ctTimeStop;
#endif

typedef unsigned int uint32_t;

// sort a bin by 3 least significant bytes
void RadixSort3(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[3][256] = {0};            // count / matrix
size_t i,j,m,n;
uint32_t u;
    if(count == 0)
        return;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 3; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 3; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 3; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}

// split array into 256 bins according to most significant byte
void RadixSort(uint32_t * a, uint32_t*b, size_t count)
{
size_t aIndex[260] = {0};               // count / array
size_t i;
    for(i = 0; i < count; i++)          // generate histogram
        aIndex[1+((size_t)(a[i] >> 24))]++;
    for(i = 2; i < 257; i++)            // convert to indices
        aIndex[i] += aIndex[i-1];
    for(i = 0; i < count; i++)          //  sort by msb
        b[aIndex[a[i]>>24]++] = a[i];
    for(i = 256; i; i--)                // restore aIndex
        aIndex[i] = aIndex[i-1];
    aIndex[0] = 0;
    for(i = 0; i < 256; i++)            // radix sort the 256 bins
        RadixSort3(&b[aIndex[i]], &a[aIndex[i]], aIndex[i+1]-aIndex[i]);
}

int main( )
{
    uint32_t * a = new uint32_t [COUNT];    // allocate data array
    uint32_t * b = new uint32_t [COUNT];    // allocate temp array
    size_t i;
    uint32_t u;
//      generate data
    for(i = 0; i < COUNT; i++){
        u  = (((uint32_t)((rand()>>4) & 0xff))<< 0);
        u += (((uint32_t)((rand()>>4) & 0xff))<< 8);
        u += (((uint32_t)((rand()>>4) & 0xff))<<16);
        u += (((uint32_t)((rand()>>4) & 0xff))<<24);
        a[i] = u;
    }

#if QP
    QueryPerformanceFrequency(&liQPFrequency);
    dQPFrequency = (double)liQPFrequency.QuadPart;
    timeBeginPeriod(1);                     // set ticker to 1000 hz
    Sleep(128);                             // wait for it to settle
    QueryPerformanceCounter(&liStartTime);
#else
    ctTimeStart = clock();
#endif

//      sort data
    RadixSort(a, b, COUNT);

#if QP
    QueryPerformanceCounter(&liStopTime);
    dStartTime = (double)liStartTime.QuadPart;
    dStopTime  = (double)liStopTime.QuadPart;
    dElapsedTime = (dStopTime - dStartTime) / dQPFrequency;
    timeEndPeriod(1);                       // restore ticker to default
    std::cout << "# of seconds " << dElapsedTime << std::endl;
#else
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
#endif

    for(i = 1; i < COUNT; i++){
        if(a[i] < a[i-1]){
            std::cout << "failed" << std::endl;
            break;}}
    if(i == COUNT)
        std::cout << "passed" << std::endl;

    delete[] b;
    delete[] a;
    return 0;
}

**Fede** · 04-01-2020

First of all thanks for sharing your code and knowledge, it's been a huge help for me.

I ran some tests using 10⁷ unsigned 32 bit integers and your serial code outperforms bash sort parallel implementation by almost an order of magnitude. It is quiet impressive.
I modified your original code so that it can sort records of variable size, X bytes, using only Y bytes as a key (Y<X). This is mainly to handle tables with multiple columns, where you sort with respect to only one.
The code is based on your radix sort algorithm from the prior thread.

Code:

unsigned char * RadixSort(unsigned char * pData, std::size_t count, std::size_t record_size, std::size_t key_size, std::size_t key_offset = 0)
{
    typedef uint8_t k_t [key_size];
    typedef uint8_t record_t [record_size];


    size_t mIndex[key_size][256] = {0};            /* index matrix */
    unsigned char * pTemp = new unsigned char [count*sizeof(record_t)];
    unsigned char * pDst, * pSrc;
    size_t i,j,m,n;
    k_t u;
    record_t v;

    for(i = 0; i < count; i++)                  /* generate histograms */
    {         
        std::memcpy(&u, (pData + record_size*i + key_offset), sizeof(u));
        
        for(j = 0; j < key_size; j++)
        {
            mIndex[j][(size_t)(u[j])]++;
            // mIndex[j][(size_t)(u & 0xff)]++;
            // u >>= 8;
        }       
    }
    for(j = 0; j < key_size; j++)             /* convert to indices */
    {
        n = 0;
        for(i = 0; i < 256; i++)
        {
            m = mIndex[j][i];
            mIndex[j][i] = n;
            n += m;
        }       
    }

    pDst = pTemp;                       /* radix sort */
    pSrc = pData;
    for(j = 0; j < key_size; j++)
    {
        for(i = 0; i < count; i++)
        {
            std::memcpy(&u, (pSrc + record_size*i + key_offset), sizeof(u));
            std::memcpy(&v, (pSrc + record_size*i), sizeof(v));
            m = (size_t)(u[j]);
            // m = (size_t)(u >> (j<<3)) & 0xff;
            std::memcpy(pDst + record_size*mIndex[j][m]++, &v, sizeof(v));
        }
        std::swap(pSrc, pDst);
        
    }
    return(pSrc);
}

Any criticism/optimization to my code is more than welcome as I am far from being a good C/C++ programmer.
For starters, in each radix pass I move around the whole record which maybe is not so efficient.

the alignment of the calling code and the alignment of a tight loop, created a range of time from 1.465 seconds to 2.000 seconds, a 36.5% increase in time due to alignment done via nops between functions

This is also surprising for me, I didn't know the alignment of the code/loop could create such a difference. I am also a bit new to this kind of optimizations, do you know a good reference about this matter?

One last question, a parallel implementation of the code seems more or less straight forward. Do you think it is worth the effort?

Sorry for the long reply. In any case with the information you provide me I am more than happy.

**rcgldr** · 04-01-2020

Originally Posted by Fede

A parallel implementation of the code seems more or less straight forward.

I modded a radix sort for 36 million psuedo random 64 bit integers, using (8,8,8,8,8,8,8,8) bit fields (8 passes), to use 4 threads corresponding to the 4 cores on my processor (Intel 3770K). The initial pass sorts by the most significant 8 bits, creating 256 logical bins. Then the 256 bins are sorted 4 at a time in parallel (multi-threading). The 4 thread version is 2.5 times as fast as the single thread version. I used Windows specific semaphores and threads for this. Probably the main advantage of the 4 threaded radix sort is that is uses all 4 core's L1 and L2 cache (the L3 cache and main memory are shared).

**rcgldr** · 04-02-2020

Example C++ code for Windows. Converting the code to C is simple, but more effort would be needed to convert to Posix semaphores and threads. This example takes about 0.4 seconds to sort 36 million pseudo random 64 bit integers, versus about 1.0 seconds for a non-threaded version.

Code:

#include <iostream>
#include <windows.h>

#define COUNT (6000*6000)            // number of values to sort

#define QP 1        // if != 0, use queryperformance for timer

#if QP
#include <math.h>
#pragma comment(lib, "winmm.lib")
typedef LARGE_INTEGER LI64;
#else
#include <ctime>
#endif

#if QP
static LI64     liQPFrequency;  // cpu counter values
static LI64     liStartTime;
static LI64     liStopTime;
static double   dQPFrequency;
static double   dStartTime;
static double   dStopTime;
static double   dElapsedTime;
#else
clock_t ctTimeStart;            // clock values
clock_t ctTimeStop;
#endif

typedef unsigned       int uint32_t;
typedef unsigned long long uint64_t;

static HANDLE hs0;                      // semaphore handles
static HANDLE hs1;
static HANDLE hs2;
static HANDLE hs3;

static HANDLE ht1;                      // thread handles
static HANDLE ht2;
static HANDLE ht3;

static uint64_t *pa, *pb;               // ptrs to arrays
static uint32_t aIndex[260];            // start + end indexes, 256 bins
static uint32_t aIi;                    // current index for 4 bins

static DWORD WINAPI Thread0(LPVOID lpvoid);

void RadixSort(uint64_t * a, uint64_t *b, size_t count)
{
uint32_t i,m,n;
uint64_t u;
    for(i = 0; i < count; i++)          // generate histogram
        aIndex[(size_t)(a[i] >> 56)]++;
    m = 0;                              // convert to indexes
    for(i = 0; i < 257; i++){           //  including end of bin[255]
        n = aIndex[i];
        aIndex[i] = m;
        m += n;
    }
    for(i = 0; i < count; i++){         // sort by msb
        u = a[i];
        m = (uint32_t)(u >> 56);
        b[aIndex[m]++] = u;
    }
    for(i = 256; i; i--)                // restore aIndex
        aIndex[i] = aIndex[i-1];
    aIndex[0] = 0;
    for(aIi = 0; aIi < 256; aIi += 4){
        ReleaseSemaphore(hs1, 1, NULL);     // start threads
        ReleaseSemaphore(hs2, 1, NULL);
        ReleaseSemaphore(hs3, 1, NULL);
        Thread0(NULL);
        WaitForSingleObject(hs0, INFINITE); // wait for threads done
        WaitForSingleObject(hs0, INFINITE);
        WaitForSingleObject(hs0, INFINITE);
    }        
}

void RadixSort7(uint64_t * a, uint64_t *b, size_t count)
{
uint32_t mIndex[7][256] = {0};          // count / index matrix
uint32_t i,j,m,n;
uint64_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 7; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }
    }
    for(j = 0; j < 7; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 7; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}

static DWORD WINAPI Thread0(LPVOID lpvoid)
{
    RadixSort7(pb + aIndex[aIi+0], pa + aIndex[aIi+0], aIndex[aIi+1]-aIndex[aIi+0]);
    return 0;
}

static DWORD WINAPI Thread1(LPVOID lpvoid)
{
    while(1){
        WaitForSingleObject(hs1, INFINITE); // wait for semaphore
        RadixSort7(pb + aIndex[aIi+1], pa + aIndex[aIi+1], aIndex[aIi+2]-aIndex[aIi+1]);
        ReleaseSemaphore(hs0, 1, NULL);     // indicate done
        if(aIi == 252)                      // exit if all bins done
            return 0;
    }
}

static DWORD WINAPI Thread2(LPVOID lpvoid)
{
    while(1){
        WaitForSingleObject(hs2, INFINITE); // wait for semaphore
        RadixSort7(pb + aIndex[aIi+2], pa + aIndex[aIi+2], aIndex[aIi+3]-aIndex[aIi+2]);
        ReleaseSemaphore(hs0, 1, NULL);     // indicate done
        if(aIi == 252)                      // exit if all bins done
            return 0;
    }
}

static DWORD WINAPI Thread3(LPVOID lpvoid)
{
    while(1){
        WaitForSingleObject(hs3, INFINITE); // wait for semaphore
        RadixSort7(pb + aIndex[aIi+3], pa + aIndex[aIi+3], aIndex[aIi+4]-aIndex[aIi+3]);
        ReleaseSemaphore(hs0, 1, NULL);     // indicate done
        if(aIi == 252)                      // exit if all bins done
            return 0;
    }
}

uint64_t rnd64()                        // random 64 bit integer
{
static uint64_t r = 1ull;
    r = r * 6364136223846793005ull + 1442695040888963407ull;
    return r;
}

int main( )
{
    uint64_t * a = new uint64_t [COUNT];    // allocate data array
    uint64_t * b = new uint64_t [COUNT];    // allocate temp array
    size_t i;
    pa = a;                                 // set global ptrs
    pb = b;

    hs0 = CreateSemaphore(NULL,0,7,NULL);   // create semaphores
    hs1 = CreateSemaphore(NULL,0,1,NULL);
    hs2 = CreateSemaphore(NULL,0,1,NULL);
    hs3 = CreateSemaphore(NULL,0,1,NULL);

    ht1 = CreateThread(NULL, 0, Thread1, 0, 0, 0);  // create threads
    ht2 = CreateThread(NULL, 0, Thread2, 0, 0, 0);
    ht3 = CreateThread(NULL, 0, Thread3, 0, 0, 0);

    for(i = 0; i < COUNT; i++){             // generate data
        a[i] = rnd64();
    }

#if QP
    QueryPerformanceFrequency(&liQPFrequency);
    dQPFrequency = (double)liQPFrequency.QuadPart;
    timeBeginPeriod(1);                     // set ticker to 1000 hz
    Sleep(128);                             // wait for it to settle
    QueryPerformanceCounter(&liStartTime);
#else
    ctTimeStart = clock();
#endif

    RadixSort(a, b, COUNT);

#if QP
    QueryPerformanceCounter(&liStopTime);
    dStartTime = (double)liStartTime.QuadPart;
    dStopTime  = (double)liStopTime.QuadPart;
    dElapsedTime = (dStopTime - dStartTime) / dQPFrequency;
    timeEndPeriod(1);                       // restore ticker to default
    std::cout << "# of seconds " << dElapsedTime << std::endl;
#else
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
#endif

    WaitForSingleObject(ht1, INFINITE); // wait for threads
    WaitForSingleObject(ht2, INFINITE);
    WaitForSingleObject(ht3, INFINITE);
    CloseHandle(ht1);                   // close threads
    CloseHandle(ht2);
    CloseHandle(ht3);
    CloseHandle(hs0);                   // close semaphores
    CloseHandle(hs1);
    CloseHandle(hs2);
    CloseHandle(hs3);
    
    for(i = 1; i < COUNT; i++){
        if(a[i] < a[i-1]){
            break;}}
    if(i == COUNT)
        std::cout << "passed" << std::endl;
    else
        std::cout << "failed" << std::endl;

    delete[] b;
    delete[] a;
    return 0;
}

**rcgldr** · 04-02-2020

Originally Posted by Fede

alignment of the code/loop could create such a difference. I am also a bit new to this kind of optimizations, do you know a good reference about this matter?

I forgot to reply to this before. It's possible that the amount of sensitivity to alignment may be specific to the Intel 3770K or any of the third generation of processors known as "Ivy Bridge", which were first released back in 2012. Later processors may not have that issue, or at least there won't be such a difference in performance due to alignment. I'm not sure of a good reference for this, since these issues are processor or processor generation specific. There is another thread about this at that other forum for second generation "Sandy Bridge" ("SnB") processors, with links to prior related threads. The threads mention tools, some created by the people involved in those threads, used to report on the individual performance related aspects of a processor. These threads will give you an idea of what is involved in finding out about such issues.

performance - Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs - Stack Overflow

That thread links to a prior thread, also about alignment:

Performance optimisations of x86-64 assembly - Alignment and branch prediction - Stack Overflow

**Fede** · 04-03-2020

As soon as I have time I'll make a version of the threaded code using gnu pthread library and post it here just in case it is useful for someone.
Once again thanks for all the really useful information!

Thread: Binary Radix Sort

Thread Tools

Search Thread

Display

Binary Radix Sort

Similar Threads

Radix Sort ( + counting sort) does not stable sort.

radix sort

radix sort and radix exchange sort.

Radix Sort

Radix sort.

Tags for this Thread