Using the bswap instruction, I get about 10x performance improvement:
Code:
C:\tmp>gcc -O3 -DCSWAP bswap.c
bswap.c: In function 'swap_end':
bswap.c:9: warning: initialization from incompatible pointer type
bswap.c:10: warning: initialization from incompatible pointer type
C:\tmp>a
Time: 29 Speed: 353 MB/s
Sum: 576716800
C:\tmp>a
Time: 30 Speed: 341 MB/s
Sum: 576716800
C:\tmp>gcc -O3 bswap.c
C:\tmp>a
Time: 3 Speed: 3413 MB/s
Sum: 576716800
This is using gcc-mignw 3.4.5
Change to the code is:
Code:
#ifdef CSWAP
int swap_end(int i) {
int rtn;
unsigned char *rtn_ptr = &rtn;
unsigned char *i_ptr = &i;
rtn_ptr[0] = i_ptr[3];
rtn_ptr[1] = i_ptr[2];
rtn_ptr[2] = i_ptr[1];
rtn_ptr[3] = i_ptr[0];
return rtn;
}
#else
int swap_end(int i)
{
__asm__ __volatile__("bswap %0": "+r"(i));
return i;
}
#endif
And I commented out the srand part, to allow it do get the SAME random numbers, so that sum is the same each run - that way I can say with some certainty that it's doing the same thing for both tasks.
The actual code generated changes from:
Code:
movb -17(%ebp), %al
movb %al, -24(%ebp)
movb -18(%ebp), %al
movb %al, -23(%ebp)
movb (%ebx), %al
movb %al, -22(%ebp)
movb -20(%ebp), %al
movb %al, (%ecx)
new code:
Using a few more registers would probably allow a bit more overlap between read/writes, but that would probably still make it slower than the bswap instruction. And it would certainly not help the rest of the code around it...
One slight problem is that different compilers will need different inline assembler syntax, so it won't be portable.
--
Mats