The 66 cycles of function call overhead I mentioned earlier is obviously wrong; it's really just 6 cycles or so in the typical case (code already cached). The rest of that 60 cycles is actually measurement overhead (rdtsc, and the manipulation of its 64-bit result, storing it) of the functions I typically use -- I just forgot that the 66 cycle figure includes both measurement and function call overheads.
Originally Posted by
rcgldr
Wouldn't the probability of the state to remain in the cache depend on the activity of other variables and data transfers performed by the program using a pseudo random number generator?
Well, no. (Edit: I mean, no, if you compare Xorshift (using four words of state), and an LCG (using only one word of state).)
You see, the L1 cacheline is the smallest unit that is cached. Usually there is very little to no difference whether a function accesses one word of data, or all of a cacheline, because the first access must anyway bring the entire cacheline to L1.
(Edit: So, what I mean is that if the other code causes the PRNG state to drop from the L1 cache, it will happen at the same cases and at the same frequency, regardless of whether that state is just one word or a full cacheline.
Obviously it depends completely on what the other code does, whether the PRNG state stays in L1 cache or not. But, when the data is already in L1 cache, accessing it is often free or almost free; costing only fractional clock cycles. The exploration below illustrates this. But how often that happens, is completely independent of whether the PRNG state is one word or a full cacheline.)
Just to keep things interesting, I ran some tests on my AMD Athlon II X4 640, using GCC-4.6, all 64-bit (kernel, libraries, generated code).
Consider these three functions:
Code:
#include <stdint.h> /* For uint32_t type only; just define to unsigned int on Windows */
static uint32_t xorshift_state[4] = {
123456789U,
362436069U,
521288629U,
88675123U
};
uint32_t xorshift_dummy(void)
{
return 0;
}
uint32_t xorshift_partial(void)
{
const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
return xorshift_state[0] ^= (temp >> 8U) ^ temp ^ (xorshift_state[0] >> 19U);
}
uint32_t xorshift_normal(void)
{
const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
xorshift_state[0] = xorshift_state[1];
xorshift_state[1] = xorshift_state[2];
xorshift_state[2] = xorshift_state[3];
return xorshift_state[3] ^= (temp >> 8U) ^ temp ^ (xorshift_state[3] >> 19U);
}
GCC-4.6.3 compiles those (given -O3 -fomit-frame-pointer) to essentially
Code:
.text
.p2align 4,,15
xorshift_dummy:
xorl %eax, %eax
ret
xorshift_partial:
movl xorshift_state(%rip), %edx
movl %edx, %ecx
movl %edx, %eax
sall $11, %ecx
shrl $19, %eax
xorl %ecx, %edx
xorl %ecx, %eax
shrl $8, %edx
xorl %edx, %eax
movl %eax, xorshift_state(%rip)
ret
xorshift_normal:
movl xorshift_state(%rip), %eax
movl xorshift_state+12(%rip), %ecx
movl %eax, %edx
sall $11, %edx
xorl %eax, %edx
movl xorshift_state+4(%rip), %eax
movl %eax, xorshift_state(%rip)
movl xorshift_state+8(%rip), %eax
movl %ecx, xorshift_state+8(%rip)
movl %eax, xorshift_state+4(%rip)
movl %ecx, %eax
shrl $19, %eax
xorl %ecx, %eax
xorl %edx, %eax
shrl $8, %edx
xorl %edx, %eax
movl %eax, xorshift_state+12(%rip)
ret
.data
.align 16
xorshift_state:
.long 123456789
.long 362436069
.long 521288629
.long 88675123
When benchmarking code, I always compile the functions in a separate compilation unit, to make sure GCC will not optimize away any calls. Using a simple test harness that calls the above functions in a loop, summing the result (for verification), and measuring the number of CPU cycles taken by the loop via RDTSC yields some surprising results:
Code:
xorshift_dummy():
6.0 cycles on average per call
xorshift_partial():
11.4 cycles on average per call
xorshift_normal():
11.1 cycles on average per call
All of the above include both the function call, summation, and loop overheads. So, the xorshift_dummy() is the baseline to compare to.
The surprising bit: The version that accesses all four state values, is FASTER than the one that does the same arithmetic operations, but to the same word. (The difference is small, and well within random fluctuations, but it is perfectly reproducible when many calls are averaged, so I do believe it is real.)
I do believe this is yet another example of how important it is to keep data flowing on all levels to get maximum performance. In xorshift_partial(), the operations modify the same word, so it cannot take full advantage of the pipelines on x86-64 CPUs. By adding the moves (and especially storing the result to a different word) -- also note how the data rotates in xorshift_normal(), that's why I included the assembly -- GCC can use the pipelines much more efficiently, ultimately yielding a slight decrease in run time, even if more work is done.
It's all about the continuous flow, baby. (Sorry, couldn't resist. I did expect the xorshift_normal() to be a clock cycle or two slower, myself, and I always get a bit excited when I find out new stuff.)
Originally Posted by
rcgldr
In my case, I'm just using the pseudo random number generators to generate test data for my sorts, and performance isn't an issue for the size of the arrays I'm generating (32MB in most cases).
Right, and like I said, radix sorting is a special case, because regular bit patterns -- which are typical for all LCGs -- cause it to run faster (due to CPU cache effects) than when the bit patterns are truly random. Comparison sorts aren't sensitive to such at all -- except possibly if you have more than one period of data, and even then it's usually a neglible effect.
Originally Posted by
rcgldr
If I'm generating a test file, so far I"ve been using the output from a pi or e generator program, which is probably overkill in an attempt to get nearly true random numbers.
Yep, I'd consider that overkill, too. I used to use Mersenne Twister a lot, but the properties and simplicity of the Xorshift makes it much nicer.
I just wish somebody had told me about Xorshift earlier, myself. That's the reason I keep bringing Xorshift up whenever PRNGs are discussed here.