cache behaviour

**cyberfish** · 12-21-2008

I am doing a little experiment with cache/memory behaviour and I think I need someone to explain what the results mean.

I have this code

Code:

#include <cstdio>
#include <ctime>

const int inc = 4096;

int main() {

        int n = 2048LL*1024*1024 / sizeof(int);

        int *g = new int[n];
        int sum;

        unsigned int start = clock();

        for (int i = 0; i < n; i += inc) {
                sum += g[i];
                __builtin_prefetch(g+inc*10); //have also tried 100
        }

        unsigned int end = clock();

        printf("%d\n", sum); //so sum won't be optimized out
        printf("Accesses per ms: %.3lf\n", (n / inc) / ((end - start) / 1000.0));
}

Basically, it allocates 2GB of memory, and add up every "inc"th element in a loop.

At the end, it prints the number of accesses made per ms on avg.

I ran it with different values of inc, and got this

inc - accesses/ms
1 - 181375
2 - 130308
4 - 90079
8 - 51622
16 - 28679
32 - 14339
64 - 7294
128 - 3779
256 - 1733
512 - 919
1024 - 444
2048 - 468
4096 - 504

It seems to be following "y = 181375 / x", until 1024, can someone please explain why?

That is not what I was expecting. I was expecting the value to only be reciprocal until one increment covers the whole cache line (so every access will be a cache miss)? It seems to be suggesting that my CPU has a cache-line size of 4KB (1024*sizeof(int)), how can that be true? I thought x86 have ~64 bytes cache-lines?

Thanks

*edit*
And I am trying to use gcc's __builtin_prefetch to improve the time (editted in red above), but that's apparently not working (no improvement in access/ms)... any idea why?

I am compiling it with

Code:

g++ -march=native -O3 prefetch_test.cpp

GCC 4.3.2, 64-bit Linux, Core 2 Duo
*/edit*

*edit2*
time unit wrong
*/edit2*

**zigga15** · 12-21-2008

Have you tired a different language? I know it is hard to test against different languages, but C might be too high level in such an experiment. There would be more compiler/pre-compiler driven instructions that would taint the results to a degree.

Start by trying different compilers and see if I am correct. Also look into LLVM, it decomposes programs to readable compiler instructions. Keep me posted, it looks interesting

~ Daniel

**Codeplug** · 12-21-2008

I would keep the number of iterations ("accesses") constant for all sizes of inc.

You can use cpuz to see what kind/size of cache your processor has: http://www.cpuid.com/cpuz.php

gg

**cyberfish** · 12-22-2008

Unfortunately C is the lowest level language I know, so it's not possible to do it in a lower-level language.

Why LLVM? For a short program like this gcc assembly output should be okay, too, I would post it if it helps.

I would keep the number of iterations ("accesses") constant for all sizes of inc.

You can use cpuz to see what kind/size of cache your processor has: http://www.cpuid.com/cpuz.php

Well, it's either keeping the number of iterations constant or keeping memory usage constant. I didn't think either would matter, so I arbitrarily chose to keep memory usage constant. Is that not the case?

Yeah, my CPU has 32KB+32KB L1 and 2MB L2, but they don't ring any bells with the results I am getting here.

**cyberfish** · 12-24-2008

3-days bump.

haven't figured it out yet.

**abachler** · 12-24-2008

Most hardware benchmarking is done in assembly, so you can direcly control the instruction stream. Also consider that there are multiple cahces in modern processors. Level 1 (L1), Level 2 (L2) and sometimes Level 3 (L3), there is also the prefetch queue which is a bit like L0 but not really.

**matsp** · 12-24-2008

Your loop is too tight for prefetch to make any difference - the processor catches up with the prefetch instructions, and from then on, it's simply waiting for the data fetch either way. You need to prefetc something that will take a few tens of nanoseconds to get to (on an Athlon64, average memory access takes about 45 ns - which is about 90+ instructions at 2GHz)

By the way, in the smaller cases, you would definitely benefit from unrolling the loop and only prefetching every cache-line (64 or 128 bytes), as the prefetch instruction will still take up a clock-cycle or two, even if it's prefetching what has already been fetched.

For the larger steps, my guess would be that the step doesn't matter at all - you are simply measuring the time it takes to make a single memory access [which may involve 2-3 memory reads because of the page-table reads involved that are also needed to be done].

--
Mats

**cyberfish** · 12-24-2008

Thanks.

For loop unrolling, I am not sure I got what you mean. Prefetching every cache-line?

As for how many steps ahead to prefetch for, I have tried 10, 50, 100, 200, 500, 1000, and they don't make any difference at all.

I know that for big steps, the time should be constant, but what I don't understand, is how it ISN'T constant until 1 step = 4KB. I thought every memory access would be a cache-miss (hence should take the same amount of time) long before that.

**matsp** · 12-24-2008

Originally Posted by cyberfish

Thanks.

For loop unrolling, I am relying on -O3 to do it, which may or may not be a good idea.

As for how many steps ahead to prefetch for, I have tried 10, 50, 100, 200, 500, 1000, and they don't make any difference at all.

I know that for big steps, the time should be constant, but what I don't understand, is how it ISN'T constant until 1 step = 4KB. I thought every memory access would be a cache-miss (hence should take the same amount of time) long before that.

It probably isn't exactly constant at 4KB either, but 4KB and above, you are not only accessing once per cache-line, but once per MMU page (4KB in size), which means that you get the full penalty of at leas the lowest of the CPU page-table being read. I guess that if you increase your step up to 2MB or 1GB, you will get even bigger steps - but that gets a bit hard to try out in a small memory machine - you need 16 or 32GB of ram for the latter [and a 64-bit OS].

Also, 4KB is most often the DRAM page-size - which means that above that, you get a full page-close, RAS and CAS set of cycles to the DRAM. [This is a completely different "page-size" than that of virtual memory - it just happens to be the same size].

-O3 will probably unroll the loop a bit, but to make good use of the prefetch, you should really manually unroll the loop and make sure you put the prefetch instruction at the beginning of a block [assuming the inc is small number - for large values, you will no matter what, end up catching up with the prefetch operations and still wait for each memory access to be completed].

--
Mats

**cyberfish** · 12-24-2008

Ah I see. So I guess it has nothing to do with the CPU caches, since the effect of memory page "misses" (?) is much greater, and masks the CPU cache-miss penalty.

I only have 4GB RAM so I guess I can't do the bigger tests. I couldn't do the 2MB test either (timer not precise enough), I will switch to gettimeofday and try it out.

assuming the inc is small number - for large values, you will no matter what, end up catching up with the prefetch operations and still wait for each memory access to be completed

I am not sure I got that. If I prefetch, for instance, the data at 100 steps after the current step, why would the loop catch up?

**matsp** · 12-25-2008

Originally Posted by cyberfish

Ah I see. So I guess it has nothing to do with the CPU caches, since the effect of memory page "misses" (?) is much greater, and masks the CPU cache-miss penalty.

I only have 4GB RAM so I guess I can't do the bigger tests. I couldn't do the 2MB test either (timer not precise enough), I will switch to gettimeofday and try it out.

I am not sure I got that. If I prefetch, for instance, the data at 100 steps after the current step, why would the loop catch up?

Because to run 100 iterations of the loop will take much less than fetching each step - you need to prefetch enough ahead that you have enough instructions between the prefetch and the next item being fetched [or at least a significant proportion of that time], otherwise the prefetch will be meaningless.

--
Mats

Thread: cache behaviour

Thread Tools

Search Thread

Display

cache behaviour

Similar Threads

need help with cache simulator!

ideas requested for a CACHE.

Resource manager tree

added start menu crashes game

cache miss/hit