cache behaviour

This is a discussion on cache behaviour within the C++ Programming forums, part of the General Programming Boards category; I am doing a little experiment with cache/memory behaviour and I think I need someone to explain what the results ...

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,145

    cache behaviour

    I am doing a little experiment with cache/memory behaviour and I think I need someone to explain what the results mean.

    I have this code
    Code:
    #include <cstdio>
    #include <ctime>
    
    const int inc = 4096;
    
    int main() {
    
            int n = 2048LL*1024*1024 / sizeof(int);
    
            int *g = new int[n];
            int sum;
    
            unsigned int start = clock();
    
            for (int i = 0; i < n; i += inc) {
                    sum += g[i];
                    __builtin_prefetch(g+inc*10); //have also tried 100
            }
    
            unsigned int end = clock();
    
            printf("%d\n", sum); //so sum won't be optimized out
            printf("Accesses per ms: %.3lf\n", (n / inc) / ((end - start) / 1000.0));
    }
    Basically, it allocates 2GB of memory, and add up every "inc"th element in a loop.

    At the end, it prints the number of accesses made per ms on avg.

    I ran it with different values of inc, and got this

    inc - accesses/ms
    1 - 181375
    2 - 130308
    4 - 90079
    8 - 51622
    16 - 28679
    32 - 14339
    64 - 7294
    128 - 3779
    256 - 1733
    512 - 919
    1024 - 444
    2048 - 468
    4096 - 504

    It seems to be following "y = 181375 / x", until 1024, can someone please explain why?

    That is not what I was expecting. I was expecting the value to only be reciprocal until one increment covers the whole cache line (so every access will be a cache miss)? It seems to be suggesting that my CPU has a cache-line size of 4KB (1024*sizeof(int)), how can that be true? I thought x86 have ~64 bytes cache-lines?

    Thanks

    *edit*
    And I am trying to use gcc's __builtin_prefetch to improve the time (editted in red above), but that's apparently not working (no improvement in access/ms)... any idea why?

    I am compiling it with
    Code:
    g++ -march=native -O3 prefetch_test.cpp
    GCC 4.3.2, 64-bit Linux, Core 2 Duo
    */edit*

    *edit2*
    time unit wrong
    */edit2*
    Last edited by cyberfish; 12-21-2008 at 09:21 PM.

  2. #2
    Registered User
    Join Date
    Dec 2008
    Posts
    9
    Have you tired a different language? I know it is hard to test against different languages, but C might be too high level in such an experiment. There would be more compiler/pre-compiler driven instructions that would taint the results to a degree.

    Start by trying different compilers and see if I am correct. Also look into LLVM, it decomposes programs to readable compiler instructions. Keep me posted, it looks interesting

    ~ Daniel

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,607
    I would keep the number of iterations ("accesses") constant for all sizes of inc.

    You can use cpuz to see what kind/size of cache your processor has: http://www.cpuid.com/cpuz.php

    gg

  4. #4
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,145
    Unfortunately C is the lowest level language I know, so it's not possible to do it in a lower-level language.

    Why LLVM? For a short program like this gcc assembly output should be okay, too, I would post it if it helps.

    I would keep the number of iterations ("accesses") constant for all sizes of inc.

    You can use cpuz to see what kind/size of cache your processor has: http://www.cpuid.com/cpuz.php
    Well, it's either keeping the number of iterations constant or keeping memory usage constant. I didn't think either would matter, so I arbitrarily chose to keep memory usage constant. Is that not the case?

    Yeah, my CPU has 32KB+32KB L1 and 2MB L2, but they don't ring any bells with the results I am getting here.

  5. #5
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,145
    3-days bump.

    haven't figured it out yet.

  6. #6
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    Most hardware benchmarking is done in assembly, so you can direcly control the instruction stream. Also consider that there are multiple cahces in modern processors. Level 1 (L1), Level 2 (L2) and sometimes Level 3 (L3), there is also the prefetch queue which is a bit like L0 but not really.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Your loop is too tight for prefetch to make any difference - the processor catches up with the prefetch instructions, and from then on, it's simply waiting for the data fetch either way. You need to prefetc something that will take a few tens of nanoseconds to get to (on an Athlon64, average memory access takes about 45 ns - which is about 90+ instructions at 2GHz)

    By the way, in the smaller cases, you would definitely benefit from unrolling the loop and only prefetching every cache-line (64 or 128 bytes), as the prefetch instruction will still take up a clock-cycle or two, even if it's prefetching what has already been fetched.

    For the larger steps, my guess would be that the step doesn't matter at all - you are simply measuring the time it takes to make a single memory access [which may involve 2-3 memory reads because of the page-table reads involved that are also needed to be done].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,145
    Thanks.

    For loop unrolling, I am not sure I got what you mean. Prefetching every cache-line?

    As for how many steps ahead to prefetch for, I have tried 10, 50, 100, 200, 500, 1000, and they don't make any difference at all.

    I know that for big steps, the time should be constant, but what I don't understand, is how it ISN'T constant until 1 step = 4KB. I thought every memory access would be a cache-miss (hence should take the same amount of time) long before that.
    Last edited by cyberfish; 12-24-2008 at 05:53 PM.

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by cyberfish View Post
    Thanks.

    For loop unrolling, I am relying on -O3 to do it, which may or may not be a good idea.

    As for how many steps ahead to prefetch for, I have tried 10, 50, 100, 200, 500, 1000, and they don't make any difference at all.

    I know that for big steps, the time should be constant, but what I don't understand, is how it ISN'T constant until 1 step = 4KB. I thought every memory access would be a cache-miss (hence should take the same amount of time) long before that.
    It probably isn't exactly constant at 4KB either, but 4KB and above, you are not only accessing once per cache-line, but once per MMU page (4KB in size), which means that you get the full penalty of at leas the lowest of the CPU page-table being read. I guess that if you increase your step up to 2MB or 1GB, you will get even bigger steps - but that gets a bit hard to try out in a small memory machine - you need 16 or 32GB of ram for the latter [and a 64-bit OS].

    Also, 4KB is most often the DRAM page-size - which means that above that, you get a full page-close, RAS and CAS set of cycles to the DRAM. [This is a completely different "page-size" than that of virtual memory - it just happens to be the same size].

    -O3 will probably unroll the loop a bit, but to make good use of the prefetch, you should really manually unroll the loop and make sure you put the prefetch instruction at the beginning of a block [assuming the inc is small number - for large values, you will no matter what, end up catching up with the prefetch operations and still wait for each memory access to be completed].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,145
    Ah I see. So I guess it has nothing to do with the CPU caches, since the effect of memory page "misses" (?) is much greater, and masks the CPU cache-miss penalty.

    I only have 4GB RAM so I guess I can't do the bigger tests. I couldn't do the 2MB test either (timer not precise enough), I will switch to gettimeofday and try it out.

    assuming the inc is small number - for large values, you will no matter what, end up catching up with the prefetch operations and still wait for each memory access to be completed
    I am not sure I got that. If I prefetch, for instance, the data at 100 steps after the current step, why would the loop catch up?

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by cyberfish View Post
    Ah I see. So I guess it has nothing to do with the CPU caches, since the effect of memory page "misses" (?) is much greater, and masks the CPU cache-miss penalty.

    I only have 4GB RAM so I guess I can't do the bigger tests. I couldn't do the 2MB test either (timer not precise enough), I will switch to gettimeofday and try it out.


    I am not sure I got that. If I prefetch, for instance, the data at 100 steps after the current step, why would the loop catch up?
    Because to run 100 iterations of the loop will take much less than fetching each step - you need to prefetch enough ahead that you have enough instructions between the prefetch and the next item being fetched [or at least a significant proportion of that time], otherwise the prefetch will be meaningless.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. need help with cache simulator!
    By dtogers123 in forum C Programming
    Replies: 3
    Last Post: 04-30-2008, 06:18 PM
  2. ideas requested for a CACHE.
    By bean66 in forum C Programming
    Replies: 2
    Last Post: 02-21-2008, 10:01 AM
  3. Resource manager tree
    By VirtualAce in forum Game Programming
    Replies: 23
    Last Post: 09-07-2007, 10:27 PM
  4. added start menu crashes game
    By avgprogamerjoe in forum Game Programming
    Replies: 6
    Last Post: 08-29-2007, 01:30 PM
  5. cache miss/hit
    By xddxogm3 in forum C++ Programming
    Replies: 3
    Last Post: 05-07-2007, 06:51 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21