I am doing a little experiment with cache/memory behaviour and I think I need someone to explain what the results mean.
I have this code
Code:
#include <cstdio>
#include <ctime>
const int inc = 4096;
int main() {
int n = 2048LL*1024*1024 / sizeof(int);
int *g = new int[n];
int sum;
unsigned int start = clock();
for (int i = 0; i < n; i += inc) {
sum += g[i];
__builtin_prefetch(g+inc*10); //have also tried 100
}
unsigned int end = clock();
printf("%d\n", sum); //so sum won't be optimized out
printf("Accesses per ms: %.3lf\n", (n / inc) / ((end - start) / 1000.0));
}
Basically, it allocates 2GB of memory, and add up every "inc"th element in a loop.
At the end, it prints the number of accesses made per ms on avg.
I ran it with different values of inc, and got this
inc - accesses/ms
1 - 181375
2 - 130308
4 - 90079
8 - 51622
16 - 28679
32 - 14339
64 - 7294
128 - 3779
256 - 1733
512 - 919
1024 - 444
2048 - 468
4096 - 504
It seems to be following "y = 181375 / x", until 1024, can someone please explain why?
That is not what I was expecting. I was expecting the value to only be reciprocal until one increment covers the whole cache line (so every access will be a cache miss)? It seems to be suggesting that my CPU has a cache-line size of 4KB (1024*sizeof(int)), how can that be true? I thought x86 have ~64 bytes cache-lines?
Thanks
*edit*
And I am trying to use gcc's __builtin_prefetch to improve the time (editted in red above), but that's apparently not working (no improvement in access/ms)... any idea why?
I am compiling it with
Code:
g++ -march=native -O3 prefetch_test.cpp
GCC 4.3.2, 64-bit Linux, Core 2 Duo
*/edit*
*edit2*
time unit wrong
*/edit2*