Example radix sort function to sort an array of 64 bit unsigned integers. To allow for variable bin sizes, the array is scanned one time to create a matrix of 8 histograms of 256 counts each, corresponding to the number of instances of each possible 8 bit value in the 8 bytes of each integer, and the histograms are then converted into indices by summing the histograms counts. Then a radix sort is performed using the matrix of indices, post incrementing each index as it is used.

Code:
```typedef unsigned long long UI64;
typedef unsigned long long *PUI64;

PUI64 RadixSort(PUI64 pData, PUI64 pTemp, size_t count)
{
size_t mIndex = {0};            /* index matrix */
PUI64 pDst, pSrc, pTmp;
size_t i,j,m,n;
UI64 u;

for(i = 0; i < count; i++){         /* generate histograms */
u = pData[i];
for(j = 0; j < 8; j++){
mIndex[j][(size_t)(u & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 8; j++){             /* convert to indices */
n = 0;
for(i = 0; i < 256; i++){
m = mIndex[j][i];
mIndex[j][i] = n;
n += m;
}
}
pDst = pTemp;                       /* radix sort */
pSrc = pData;
for(j = 0; j < 8; j++){
for(i = 0; i < count; i++){
u = pSrc[i];
m = (size_t)(u >> (j<<3)) & 0xff;
pDst[mIndex[j][m]++] = u;
}
pTmp = pSrc;
pSrc = pDst;
pDst = pTmp;
}
return(pSrc);
}``` 2. A few notes, if you don't mind: Originally Posted by rcgldr Code:
```typedef unsigned long long UI64;
typedef unsigned long long *PUI64;```
Please use the C99 uint64_t name for the type instead. (Just because Microsoft chooses not to provide it, does not mean you cannot define it yourself.)

And please do not define pointer types, as they make the code harder to read. Originally Posted by rcgldr Code:
```RadixSort(PUI64 pData, PUI64 pTemp, int count)
int mIndex = {0};```
The int type for the count and the indexes restricts the array length to 231-1 = 2,147,483,647 entries.

Have you checked the cache effects of having eight index arrays as opposed to just reusing one? It looks like an obvious improvement, avoiding the seven extra passes over the data, for very little cost (14k more stack used on 64-bit); the roughly 8% improvement to my implementation at various sizes seems to agree:
Attachment 12889

Did you verify your keys have 64 random bits, to make it as hard for the CPU as possible? Most C libraries provide a poor random number generator. For this kind of testing, I use Xorshift:
Code:
```static uint32_t xorshift_state = {
123456789U,
362436069U,
521288629U,
88675123U
};

static uint32_t xorshift_u32(void)
{
const uint32_t temp = xorshift_state ^ (xorshift_state << 11U);
xorshift_state = xorshift_state;
xorshift_state = xorshift_state;
xorshift_state = xorshift_state;
return xorshift_state ^= (temp >> 8U) ^ temp ^ (xorshift_state >> 19U);
}

static uint64_t xorshift_u64(void)
{
return (((uint64_t)xorshift_u32()) << 32) + (uint64_t)xorshift_u32();
}``` 3. Originally Posted by Nominal Animal I'll make this change on the next update. Originally Posted by Nominal Animal please do not define pointer types, as they make the code harder to read.
Old habit, microsoft old C stuff did this a lot. Originally Posted by Nominal Animal The int type for the count and the indexes. Originally Posted by Nominal Animal Have you checked the cache effects of having eight index arrays as opposed to just reusing one? It looks like an obvious improvement, avoiding the seven extra passes over the data.
I don't understand your point. I make one pass of the data to generate all 8 sets of histograms in that matrix, then convert that matrix to 8 sets of 256 indices (these are the initial indices to the 256 logical bins used for each radix sort pass). During each radix sort pass, the high order index is fixed, so only one set (row) of 256 indices is being operated on per radix sort pass. Originally Posted by Nominal Animal Did you verify your keys have 64 random bits, to make it as hard for the CPU as possible?
Code snippet of what I use for all the test sorts in order to verify the output matches known good sorts. If i recall correctly it works fairly well, because the repeat cycle doesn't sync up to a multiple of 8 until it repeats 8 times. I don't know how many 64 bit integers this will produce before it starts to repeat.

I may convert to xorshift, but that will mean changing a fairly large group of test sort programs, mostly used to benchmark different algorithms.

Code:
```/* ... */
for(i = 0; i < NUMREC; i++){
r  = (((UI64)((rand()>>4) & 0xff))<< 0);
r += (((UI64)((rand()>>4) & 0xff))<< 8);
r += (((UI64)((rand()>>4) & 0xff))<<16);
r += (((UI64)((rand()>>4) & 0xff))<<24);
r += (((UI64)((rand()>>4) & 0xff))<<32);
r += (((UI64)((rand()>>4) & 0xff))<<40);
r += (((UI64)((rand()>>4) & 0xff))<<48);
r += (((UI64)((rand()>>4) & 0xff))<<56);
pData[i] = r;
}```
I ran this sort with 10 million (10x1024x1024) integers, and the run time was 0.422 seconds as opposed to 0.785 seconds with the conventional merge sort program. Again, Intel 2600K, Win XP X64, in 64 bit mode. Using Visual Studio 2005 for the compiler in both cases. 4. Originally Posted by rcgldr The code I posted in the other thread reuses the same array of 256 indices on all eight passes. Thus, it has to repopulate the index array again on each pass; eight times.

If there is no payload at all, on many architectures 210 or 211 bin sizes yield a faster implementation, due to fewer passes (7 or 6) over the data, than 28. This increases the index array size correspondingly.

If you generate the indices using a single pass over the data, not all of the index array may fit into L1 cache. I assumed 8×256 bins (16384 bytes on 64-bit) would be near the upper limit for my L1 cache (AMD Athlon II X4 640; 64k of L1 per core) considering that the L1 cache will also cache parts of the dataset, but I was obviously wrong. Rewriting the code linked above to use a count array populated in a single pass, saves about 8% of run time.

(It's been something like five years the last time I optimized that code, so do forgive me for tuning it for older CPU's!)

Looking at the numbers, I suspect on your CPU -- all CPUs with more than 64k of L1 cache per core -- using six passes (four 11-bit and two 10-bit) rather than 8 might yield a faster implementation.

On my CPU it is slightly slower, because of the limited L1 cache size. While the active index array for 11-bit bins takes just 8×211 = 16384 bytes, most of the L1 cache is needed for handling the scattered stores efficiently. (This is also the reason why the test data needs to be random: current CPUs have very good predictors, and can detect most linear access patterns easily. So, linear congruential generators are not suitable for testing in my opinion.) Originally Posted by rcgldr Code snippet of what I use for all the test sorts
There is a list of rand() implementations in this Wikipedia article on LCG's. Four bits of every value you grab is from the previous value (seed), four from the newly generated value. If the period is exactly 232, you may only get 232-3=536870912 unique values, yielding a period of 536870912 -- you only get 229 unique values out of 264 possible. I don't think that's good enough for reliable testing, especially not for the radix sort.

Xorshift (Wikipedia) is faster, and passes the diehard tests. Although both were developed by George Marsaglia, the diehard tests are considered pretty reliable indicators on the quality of a pseudorandom number generator.

If you need a period longer than 2128, I recommend the Mersenne Twister. (I prefer Xorshift even for simulations, though; 2128 is plenty for me.) Originally Posted by rcgldr I ran this sort with 10 million (10x1024x1024) integers, and the run time was 0.422 seconds as opposed to 0.785 seconds with the conventional merge sort program.
Ah, that means the radix sort is faster than traditional sorts at millions of elements, like I've assumed thus far. Do you happen to have a quicksort implementation you could test also, for comparison?

Do you have a version that keeps a 64-bit payload with each key, similar to my implementation? i.e. sorts (key, payload) pairs based on the key? Sorting numbers only is rarely needed, but having a pointer payload is quite common.

By the way, none of this is important to me; I only find it interesting, nothing more. Do not bother, unless you find this interesting yourself, too.

(The implementations I use for MD simulations et cetera tend to be very specialized, and run on specific hardware, so they are not directly comparable.) 5. Originally Posted by Nominal Animal ... random number generator
The repeat cycle for rand is 2^32:

Code:
```long rand(void)
{
static long seed;  /* initialized to 1, (one per thread in windows) */
return(( (seed = seed * 214013L + 2531011L) >> 16) & 0x7fff );
}
/*  214013 =     17 * 12589 */
/* 2531011 = 7 * 17 * 21269 */```
Multiply should be pretty fast on modern cpu's. The main issue is that rand() limits output to 15 bits. Since I call it 8 times per 64 bit pseudo random integer, the repeat cycle is 2^29. If I have a test program that needs a longer repeat cycle, or if the data generation time is an issue, I'll switch to one of the other pseudo random number generator functions Originally Posted by Nominal Animal radix sort is faster than traditional sorts at millions of elements, like I've assumed thus far.
I would assume the break even point would occur at a much smaller number of elements, even using the simple version I wrote. Originally Posted by Nominal Animal Do you happen to have a quick sort implementation you could test also, for comparison?
I didn't write one myself. C++ vector::sort() is a quick sort, but there's some overhead for the generic template form of this. Originally Posted by Nominal Animal Do you have a version that keeps a 64-bit payload with each key, similar to my implementation?
Essentially that would be a 16 byte record, with the first 8 bytes used for compare. In your case, it's two arrays of 8 byte elements. I don't have one of these. 6. Originally Posted by rcgldr Multiply should be pretty fast on modern cpu's.
It is, but not compared to bit shifts and binary exclusive-or, which are often just one or two clock cycles on many architectures. I think the three shifts, four XORs, and the four moves, are faster than integer multiplication + addition.

I should have said "not slower, possibly faster" rather than "faster", as I haven't compared Xorshift to Visual Studio rand()! Originally Posted by rcgldr the repeat cycle is 2^29
Because there are repeating bit patterns in LCGs, the sequences generated may affect radix sort run times. In particular, the run times measured are likely less than they are for truly random data.

For example, if you take double values between 0 and 1, or -1 and +1, and sort them using my radix sort, you'll find that it is faster than the same number of truly random 64-bit integers. This is because the doubles have essentially the same bit pattern in the highest ten bits or so (after the operation to convert them to uint64_t's). The "speed boost" due to the last loop in the radix sort using essentially a single bin, is greater than the extra two passes over the data done to convert the doubles to uint64_t's and back.

It's really not that important, unless you need precise measurements for example for publication. Originally Posted by rcgldr I would assume the break even point would occur at a much smaller number of elements, even using the simple version I wrote.
Right.

Thing is, radix sort performs much more poorly for smaller dataset than other algorithms; whereas other algorithms perform only slightly worse than radix sort for sizes within a decade or so of the break even point. Therefore, it's better to overestimate the break-even point, than to underestimate it. Originally Posted by rcgldr I didn't write one myself. C++ vector::sort() is a quick sort, but there's some overhead for the generic template form of this.
That's okay, I was only interested because you have a CPU I haven't yet had access to.

By the way, data ordering -- whether to use a single array of structures, or separate arrays for each field -- is a slightly related, but even more complex topic.

Just as an example, consider 3D coordinates: whether to use one array of triplets (or quads for alignment), or three separate arrays.

If you need the distances between (sets of) coordinates, using separate arrays on current Intel/AMD architectures is much, much better. This is because the operations vectorize easily (to SSE2/SSE3/AVX), doubling or quadrupling the speed at which the distances (and delta vectors) can be calculated. (I'm thinking about trying to write a paper about them and their use in MD simulations, "augmented neighbour lists", actually.) 7. Originally Posted by Nominal Animal I think the three shifts, four XORs, and the four moves, are faster than integer multiplication + addition.
I was considering the number of variables involved and the number of possible memory operations involved could have more effect than the instruction timing differences, one variable using multiply + add, versus 5 variables using shifts and xors. Originally Posted by Nominal Animal random data
The test src.txt file in sortp.zip is based on the binary output of a pi generator program, masking each byte with 0x7f and adding 0x20 to get extended ascii range of "text", so it should be pretty close to random data. I forgot to mention that the msortp.cpp in that zip file is setup for natural merge, and carries the same overhead of using an array of group counts, starting off with an array of n/2 counts (the counts are used to track group or "run" boundaries) in the basic merge sort, and starts off with maximum of n/2 counts in natural mode, depending on natural runs (ascending or descending). Originally Posted by Nominal Animal Thing is, radix sort performs much more poorly for smaller dataset than other algorithms
Seems the size of the compared data (the "key") is an important factor, since that determines the number of passes. Radix sort based on byte values would be bad if sorting records with 64 bytes each (64 passes) as opposed to 8 bytes each (8 passes) (not including the histogram pass, which would add another pass). 8. Originally Posted by rcgldr I was considering the number of variables involved
If the state is in a single cacheline, the access cost really is neglible. For every generator, the state must be in L1. Whether it is just one integer or a few, as long as it fits in a cache line (the smallest unit read into L1), it takes the same amount of time. Since current CPUs are typically pipelined, a good compiler may produce code that interleaves the moves and bit shifts keeping all pipelines active; in such a case, the moves may be completely free. (Because otherwise the CPU would just spend the same time waiting for the pending operations to complete.)

(Once again, keeping information flowing at all times on all levels yields maximum performance. Or, as in this case, doing so may let the CPU do "extra" work for "free".)

In practice, the function call overhead (~66 clocks on GCC-4.6 on x86-64, if I remember correctly) is at least an order of magnitude larger than the Xorshift needs (if inlined), so if you need to generate huge amounts of data, make sure your xorshift generator -- or even LCG if using one -- is inlineable to the compiler. I typically have it implemented in header file. Originally Posted by rcgldr Seems the size of the compared data (the "key") is an important factor
Oh, absolutely!

If you consider my implementation in the other thread (this post), it is well worth the added code to replace radix_sort_i64_ptr() and even worth the added extra runtime for radix_sort_u64_ptr(), to do an extra pass over the keys to see if they vary in a smaller range. For example, if the difference between minimum and maximum key is less than 256, then substracting the minimum (during radix indexing), and doing only 7 radix rounds, and then adding the minimum back, yields the same results but quicker. Similarly for 248, 232, 224, 216, and 28, of course. (You do need eight variants of the same function, though, which is a bit tedious for a programmer.)

For doubles (and 32-bit floats for 32-bit radix sort), the minimum-maximum is obviously considered after the transformation to integer values, otherwise the logic is the same.

The extra cost is one pass over the data for unsigned keys; for signed keys and floating-point keys, the extra cost is folded into the indexing pass. So, here, the added code is certainly worth it!

Note that while locating the min,max and substracting/adding a constant may look like a slowdown, they are all done in a phase which will be memory bandwidth-limited in all practical cases (large enough datasets for radix sorting to make sense). There, doing a bit of extra arithmetic that does not require memory accesses, does not increase the runtime measurably in practice. In essence, instead of waiting for RAM-to-L1 cache transfers to complete, sometimes the CPU does a bit of arithmetic first, then waits.
_ _ _

Surprisingly, the size of the payload affects the time taken in similar fashion.

That is because the payload is also scatter-stored to the target array, requiring L1 cache space to be efficient.

The 64-bit key and 64-bit payload is, in my opinion, typical for the tasks I'd use it for. In some cases I might need 64-bit key and 128-bit payload -- the payload being an offset,length pair. (Such cases are easy to test by replacing void * with a custom type of suitable size, in the implementation I liked to above.)

This "finickyness", I guess, is one of the reasons radix sort gets overlooked so often. After all, having a dataset with millions of entries is no longer rare, even on desktop machines. 9. Originally Posted by Nominal Animal If the state is in a single cacheline
Wouldn't the probability of the state to remain in the cache depend on the activity of other variables and data transfers performed by the program using a pseudo random number generator? In my case, I'm just using the pseudo random number generators to generate test data for my sorts, and performance isn't an issue for the size of the arrays I'm generating (32MB in most cases). However, I have now used xor shift in one of my new test programs. If I'm generating a test file, so far I"ve been using the output from a pi or e generator program, which is probably overkill in an attempt to get nearly true random numbers. 10. The 66 cycles of function call overhead I mentioned earlier is obviously wrong; it's really just 6 cycles or so in the typical case (code already cached). The rest of that 60 cycles is actually measurement overhead (rdtsc, and the manipulation of its 64-bit result, storing it) of the functions I typically use -- I just forgot that the 66 cycle figure includes both measurement and function call overheads. Originally Posted by rcgldr Wouldn't the probability of the state to remain in the cache depend on the activity of other variables and data transfers performed by the program using a pseudo random number generator?
Well, no. (Edit: I mean, no, if you compare Xorshift (using four words of state), and an LCG (using only one word of state).)

You see, the L1 cacheline is the smallest unit that is cached. Usually there is very little to no difference whether a function accesses one word of data, or all of a cacheline, because the first access must anyway bring the entire cacheline to L1.

(Edit: So, what I mean is that if the other code causes the PRNG state to drop from the L1 cache, it will happen at the same cases and at the same frequency, regardless of whether that state is just one word or a full cacheline.

Obviously it depends completely on what the other code does, whether the PRNG state stays in L1 cache or not. But, when the data is already in L1 cache, accessing it is often free or almost free; costing only fractional clock cycles. The exploration below illustrates this. But how often that happens, is completely independent of whether the PRNG state is one word or a full cacheline.)

Just to keep things interesting, I ran some tests on my AMD Athlon II X4 640, using GCC-4.6, all 64-bit (kernel, libraries, generated code).

Consider these three functions:
Code:
```#include <stdint.h> /* For uint32_t type only; just define to unsigned int on Windows  */

static uint32_t xorshift_state = {
123456789U,
362436069U,
521288629U,
88675123U
};

uint32_t xorshift_dummy(void)
{
return 0;
}

uint32_t xorshift_partial(void)
{
const uint32_t temp = xorshift_state ^ (xorshift_state << 11U);
return xorshift_state ^= (temp >> 8U) ^ temp ^ (xorshift_state >> 19U);
}

uint32_t xorshift_normal(void)
{
const uint32_t temp = xorshift_state ^ (xorshift_state << 11U);
xorshift_state = xorshift_state;
xorshift_state = xorshift_state;
xorshift_state = xorshift_state;
return xorshift_state ^= (temp >> 8U) ^ temp ^ (xorshift_state >> 19U);
}```
GCC-4.6.3 compiles those (given -O3 -fomit-frame-pointer) to essentially
Code:
```    .text
.p2align 4,,15

xorshift_dummy:
xorl    %eax, %eax
ret

xorshift_partial:
movl    xorshift_state(%rip), %edx
movl    %edx, %ecx
movl    %edx, %eax
sall    \$11, %ecx
shrl    \$19, %eax
xorl    %ecx, %edx
xorl    %ecx, %eax
shrl    \$8, %edx
xorl    %edx, %eax
movl    %eax, xorshift_state(%rip)
ret

xorshift_normal:
movl    xorshift_state(%rip), %eax
movl    xorshift_state+12(%rip), %ecx
movl    %eax, %edx
sall    \$11, %edx
xorl    %eax, %edx
movl    xorshift_state+4(%rip), %eax
movl    %eax, xorshift_state(%rip)
movl    xorshift_state+8(%rip), %eax
movl    %ecx, xorshift_state+8(%rip)
movl    %eax, xorshift_state+4(%rip)
movl    %ecx, %eax
shrl    \$19, %eax
xorl    %ecx, %eax
xorl    %edx, %eax
shrl    \$8, %edx
xorl    %edx, %eax
movl    %eax, xorshift_state+12(%rip)
ret

.data
.align 16
xorshift_state:
.long    123456789
.long    362436069
.long    521288629
.long    88675123```
When benchmarking code, I always compile the functions in a separate compilation unit, to make sure GCC will not optimize away any calls. Using a simple test harness that calls the above functions in a loop, summing the result (for verification), and measuring the number of CPU cycles taken by the loop via RDTSC yields some surprising results:
Code:
```xorshift_dummy():
6.0 cycles on average per call

xorshift_partial():
11.4 cycles on average per call

xorshift_normal():
11.1 cycles on average per call```
All of the above include both the function call, summation, and loop overheads. So, the xorshift_dummy() is the baseline to compare to.

The surprising bit: The version that accesses all four state values, is FASTER than the one that does the same arithmetic operations, but to the same word. (The difference is small, and well within random fluctuations, but it is perfectly reproducible when many calls are averaged, so I do believe it is real.)

I do believe this is yet another example of how important it is to keep data flowing on all levels to get maximum performance. In xorshift_partial(), the operations modify the same word, so it cannot take full advantage of the pipelines on x86-64 CPUs. By adding the moves (and especially storing the result to a different word) -- also note how the data rotates in xorshift_normal(), that's why I included the assembly -- GCC can use the pipelines much more efficiently, ultimately yielding a slight decrease in run time, even if more work is done.

It's all about the continuous flow, baby. (Sorry, couldn't resist. I did expect the xorshift_normal() to be a clock cycle or two slower, myself, and I always get a bit excited when I find out new stuff.) Originally Posted by rcgldr In my case, I'm just using the pseudo random number generators to generate test data for my sorts, and performance isn't an issue for the size of the arrays I'm generating (32MB in most cases).
Right, and like I said, radix sorting is a special case, because regular bit patterns -- which are typical for all LCGs -- cause it to run faster (due to CPU cache effects) than when the bit patterns are truly random. Comparison sorts aren't sensitive to such at all -- except possibly if you have more than one period of data, and even then it's usually a neglible effect. Originally Posted by rcgldr If I'm generating a test file, so far I"ve been using the output from a pi or e generator program, which is probably overkill in an attempt to get nearly true random numbers.
Yep, I'd consider that overkill, too. I used to use Mersenne Twister a lot, but the properties and simplicity of the Xorshift makes it much nicer.

I just wish somebody had told me about Xorshift earlier, myself. That's the reason I keep bringing Xorshift up whenever PRNGs are discussed here. Popular pages Recent additions 