Thread: how expensive are memory, L1, L2 accesses?

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229

    how expensive are memory, L1, L2 accesses?

    Hi,
    I was wondering, how expensive (CPU cycles) are memory, L2 cache, L1 cache accesses on a modern x86 computer? I cannot seem to find an answer by googling. I am asking this because in some programs written by others, I am seeing something like:
    Code:
    Uint64 Bits[64];
    for (unsigned int i = 0; i != 64; ++i) {
    	Bits[i] = 1LL << i;
    }
    and later on, the programmer uses:
    Code:
    Bits[bitnumber];
    rather than
    Code:
    1LL << bitnumber;
    Is this done purely for readability? Because I would assume that the memory load operation takes more cycles than the bitwise shift, since, if I remembered right, bitwise shifts are done in 1 cycle?

    Thank you very much

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    L1 cache hits are "free" - they can be resolved without delay to the calculation.

    L2 cache hits take about 10-15 cycles, give or take a bit depending on the processor architecture.

    Miss in both L1 and L2 means a memory read, which is on average 40+ nanoseconds, which translates to about 80+ cycles at 2GHz.

    On a 64-bit machine, a 64-bit shift can be done in one cycle with a constant shift count, but on a 32-bit machine it's a SHLD operation, so it would take about 4 cycles. A non-constant (register) shift count takes 4 cycles (6 cycles for a SHLD). [Constant: 1LL << 16, non-constant 1LL << n].

    It's probably not a lot in it either way, but I suspect on older architectures, the memory array [which fits in ANY cache of a modern processor] is most likely faster than the shift variety.

    --
    Mats

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    It seems to me like one of those random "improvements" for some architecture long ago (say 8 or 16 bit machines) when memory always ran at the same speed as the processor, and processors could only rotate 1 bit per tick.

    Now, it just looks anachronistic and a case of "premature optimisation disease".

    > Miss in both L1 and L2 means a memory read,
    Hit a page fault, and I'll see you in the spring.
    A PF might take several mS to resolve, so you don't need that many to completely screw your average case.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Salem View Post
    It seems to me like one of those random "improvements" for some architecture long ago (say 8 or 16 bit machines) when memory always ran at the same speed as the processor, and processors could only rotate 1 bit per tick.

    Now, it just looks anachronistic and a case of "premature optimisation disease".

    > Miss in both L1 and L2 means a memory read,
    Hit a page fault, and I'll see you in the spring.
    A PF might take several mS to resolve, so you don't need that many to completely screw your average case.
    Sure - but if we assume that the following is true:
    1. There's enough memory to cope with teh application as such.
    2. The initialization of the bits array is done shortly before it's being used.
    3. There's no other (big) applications running that swipe our memory from under our feet.

    Then the access time of memory is in the order or tens of nanoseconds. And at a 64 *8 byte array that is frequently accessed shouldn't end up paged out unless the system is severely low on memory - and if the system is that low on memory, everything will run VERY slow.

    Pagefaults happen for three reasons:
    1. Page-in is needed to complete memory access - this happens when either when the application first starts (first access of a particular page) or the OS has decided to remove the apps memory to use the memory for someone else - the next time this app needs the memory there's a pagefault. Once the page is actually in memory, it shouldn't normally be thrown out unless the machine is running low on memory. But when it happens, we're looking at (tens of) milliseconds to wait for the data to be fetched from hard-disk.
    2. Invalid memory access - this causes an application exception, and shouldn't happen at all in an application (in OS drivers and system call code it must be "allowed" to happen, as applications
    3. OS internal reasons, e.g. statistics, copy-on-write or "if no-one writes to this page, we'll throw it out next" type things. These are usually reasonably fast, but of course more overhead than just accessing the page.

    Whilst we are on the subject of page-faults, another aspect is of course that the page-table read overhead will affect applications using large amounts of memory, particularly if the memory is accessed "at random". This is, at worst, one read per page-table level (2 to 4) per 4KB page. This is particularly noticable for big databases and for example hash-tables [since the hash-key if it's good is "random", two accesses of sequential data will lead to completley different places in the hash-table]. There is a cache (TLB) that keeps track of recently accessed pages, which helps for frequent accessses, and modern processors will speculatively load page-table entries for sequential accesses that "look like it's going to hit the next page".

    --
    Mats

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Well, I started thinking about it, and it turns out that there's quite a difference, at least with Microsoft's compiler (I don't have a GCC on my machine - I have Linux box sitting around waiting for me to connect my KVM switch :-().

    But the result is about 10x worse with shifts than with the array solution. The reason for that is mainly that the 64-bit shift is done as a function call. Of course, it doesn't exaclty help the shift's case that VC 7 unrolls the loop for the array case, and doesn't for the shift case (probably due to the shift).

    The reason it's hard to do inline is that a 64-bit shift is actually not as simple as a single SHLD instructon - it is for values where you shift by less than 31 bits (on a 32-bit processor).

    I tried writing some assembler to do a specific 64-bit integer shift, but it didn't work out right - too late to debug assembler code now - I may return to it tomorrow if I can find some time.

    --
    Mats

  6. #6
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Thank you for your explanation, I think I get it now. One more question, so is the bitwise shift faster if the program is ran on a 64-bit system?

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    All 64-bit operations (that means your "and" or "or" operations on the 64-bit number, as well as shifts) will be faster on a 64-bit OS/Application combination - this is simply because it's ONE single operation to do 64 bit ops.

    Nex thing you'll ask is "how much", and I'll have to come back on that (I'll just hack my code to use Uint32 instead of Uint64, but I'm doing something else right now). Should get done this evening,

    --
    Mats

  8. #8
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by cyberfish View Post
    Hi,
    I was wondering, how expensive (CPU cycles) are memory, L2 cache, L1 cache accesses on a modern x86 computer? I cannot seem to find an answer by googling
    There's no single answer. Even if you restrict your inquiry to x86, the cache performance depends on the particular model and production of chip, as well as the actual amount of cache.

    and later on, the programmer uses:
    Code:
    Bits[bitnumber];
    rather than
    Code:
    1LL << bitnumber;
    Is this done purely for readability?
    Impossible to guess without a LOT more context. It may be very well justified, or it could be stupid. I doubt it is an optimization, however. It probably makes some other part of the code easier to work with (or again, maybe the programmer is just clueless)

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by brewbuck View Post
    Impossible to guess without a LOT more context. It may be very well justified, or it could be stupid. I doubt it is an optimization, however. It probably makes some other part of the code easier to work with (or again, maybe the programmer is just clueless)

    Whilst I agree, see my comments above on the subject of 64-bit shifts in Microsofts compiler (and I actually don't expect GCC to be different).

    On a 64-bit machine (with 64-bit OS), obviously, 64-bit shifts would be relatively cheap, and there's not much difference between the two solutions then.

    --
    Mats

  10. #10
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Ok, so I've now tested both 64 and 32-bit values, and whilst the 32-bit version is overall faster, there's still a big difference in speed between the "shift a constant one every time" and the "array of bits pre-shifted" - here's my test-code (not supposed to be pretty!):
    Code:
    // test.cpp : Defines the entry point for the DLL application.
    //
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    typedef unsigned __int64 u64;
    typedef unsigned int u32;
    
    u64 bits[64];
    u32 bits32[32];
    
    int c1;
    clock_t t;
    
    inline void a(u64 aa, int n) 
    {
    	if ((aa & (1I64 << n)) && !(aa & (1I64 << (n+1)))) c1++;
    }
    
    inline void b(u64 aa, int n)
    {
    	if (aa & bits[n] && !(aa & bits[n + 1])) c1++;
    }
    
    
    inline void d(unsigned aa, int n)
    {
    	if ((aa & (1 << n)) && !(aa & (1 << (n+1)))) c1++;
    }
    
    
    inline void e(unsigned aa, int n)
    {
    	if ((aa & bits32[n]) && !(aa & bits32[n+1])) c1++;
    }
    
    void init(void)
    {
        c1 = 0;
    	t = clock();
    }
    
    void report(char *name)
    {
    	t = clock() - t;
    	printf("Function %s\n", name);
    	printf("c1 = %d\n", c1);
    	printf("t = %f\n", (double)t / CLOCKS_PER_SEC);
    }
    
    int main(void)
    {
    	int i, j, n;
    
    	for(i = 0; i < 64; i++)
    		bits[i] = 1I64 << i;
    
    	for(i = 0; i < 32; i++)
    		bits32[i] = 1 << i;
    
    	init();
    	for(j = 0; j < 1000; j++)
    		for(n = 0; n < 60; n++) 
    			for(i = 0; i < 10000; i++)
    				a(j, n);
    	report("a");
    
    	init();
    	for(j = 0; j < 1000; j++) 
    		for(n = 0; n < 60; n++) 
    			for(i = 0; i < 10000; i++) 
    				b(j, n);
    	report("b");
    
    	init();
    	for(j = 0; j < 1000; j++) 
    		for(n = 0; n < 30; n++) 
    			for(i = 0; i < 20000; i++) 
    				d(j, n);
    	report("d");
    
    	init();
    	for(j = 0; j < 1000; j++) 
    		for(n = 0; n < 30; n++) 
    			for(i = 0; i < 20000; i++) 
    				e(j, n);
    	report("e");	
    
    	return 0;
    }
    Results:
    Code:
    Function a
    c1 = 27680000
    t = 2.203000
    Function b
    c1 = 27680000
    t = 0.265000
    Function d
    c1 = 55360000
    t = 1.703000
    Function e
    c1 = 55360000
    t = 0.172000
    --
    Mats

  11. #11
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Have you checked in the assembly that all functions actually get inlined?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  12. #12
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Have you checked in the assembly that all functions actually get inlined?
    Code:
    $L1093:
    
    ; 65   : 		for(n = 0; n < 60; n++) 
    ; 66   : 			for(i = 0; i < 10000; i++)
    ; 67   : 				a(j, n);
    
    	mov	eax, DWORD PTR _j$[esp+32]
    	cdq
    	xor	esi, esi
    	mov	ebp, eax
    	mov	DWORD PTR $T1180[esp+36], edx
    $L1096:
    	mov	eax, 1
    	xor	edx, edx
    	mov	ecx, esi
    	call	__allshl
    	mov	edi, eax
    	mov	eax, DWORD PTR $T1180[esp+36]
    	mov	ebx, edx
    	and	edi, ebp
    	and	ebx, eax
    	mov	DWORD PTR tv473[esp+32], 10000		; 00002710H
    $L1099:
    	mov	ecx, edi
    	or	ecx, ebx
    	je	SHORT $L1100
    	lea	ecx, DWORD PTR [esi+1]
    	mov	eax, 1
    	xor	edx, edx
    	call	__allshl
    	mov	ecx, DWORD PTR $T1180[esp+36]
    	and	eax, ebp
    	and	edx, ecx
    	or	eax, edx
    	jne	SHORT $L1100
    	inc	DWORD PTR ?c1@@3HA			; c1
    $L1100:
    	dec	DWORD PTR tv473[esp+32]
    	jne	SHORT $L1099
    	inc	esi
    	cmp	esi, 60					; 0000003cH
    	jl	SHORT $L1096
    	mov	eax, DWORD PTR _j$[esp+32]
    	inc	eax
    	cmp	eax, 1000				; 000003e8H
    	mov	DWORD PTR _j$[esp+32], eax
    	jl	SHORT $L1093
    
    ; 68   : 	report("a");
    
    	push	OFFSET FLAT:??_C@_01MCMALHOG@a?$AA@
    	call	?report@@YAXPAD@Z			; report
    	add	esp, 4
    This is the code for "calling" a(), as you can see by the included C comments. As discussed before, the long shift is not inlined - if you know of a way to make that happen, let me know. [One problem with "inlining" the 64-bit shift is that it's actually an external assembler function, so the compiler doesn't natively know how to do a 64-bit shift for some reason]. -- and before anyone says so, yes, I've tried "Max Optimization" as well as "Optimize for speed" - which are the two options most likely to make it work.

    I also tried to write my own inline version of "shift64", but that was slower - because of the parameters passed to the inline function get stored in a temporary memory location and then retrieved from the memory location in the assembly code - short of writing the WHOLE function in assembler, there's nothing you can do about that - I've tried before!

    [If the functions are NOT inlined, the time is roughly 8 seconds per function, and the calling overhead is much larger than the benefit/loss of any particular method of checking the bits].

    --
    Mats

  13. #13
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    thank you for your explanation

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help with insert/delete binary search tree
    By Nazgulled in forum C Programming
    Replies: 39
    Last Post: 03-25-2009, 04:24 PM
  2. Problems with shared memory shmdt() shmctl()
    By Jcarroll in forum C Programming
    Replies: 1
    Last Post: 03-17-2009, 10:48 PM
  3. Suggestions on this C style code
    By Joelito in forum C Programming
    Replies: 11
    Last Post: 06-07-2007, 03:22 AM
  4. Is it necessary to write a specific memory manager ?
    By Morglum in forum Game Programming
    Replies: 18
    Last Post: 07-01-2002, 01:41 PM
  5. What's the best memory (RAM) type?
    By Unregistered in forum A Brief History of Cprogramming.com
    Replies: 17
    Last Post: 12-15-2001, 12:37 AM