Intel 3770K ~35% time difference in tight loop depending on location

Intel 3770K ~35% time difference in tight loop depending on location - other cpus?

The code below is X86 64 bit assembly code to calculate the sum of Fib(0) to Fib(93). It includes a conditional to use either a tight loop or an unfolded loop with a computed jump to enter the unfolded loop similar to C switch / case. Originally, it was meant to see the performance improvement of unfolding a loop, but during the testing, it turns out that the tight loop code is very sensitive to where the code is located. The table with multiples of 37 % 93 is used mostly to vary the jumps on the unfolded code.

The code is run 2^20 = 1048576 times for a bench mark, so this is actually a nested loop where the inner loop runs from 0 to 93 times, for an average of ~46 (since 0 and 1 inputs skip the inner loop). Even if the outer and inner loop have a cache conflict, since the inner loop is run an average of ~46 times, it shouldn't have as much impact on the timing that I'm seeing.

The times are noted below, but for the tight loop, depending on code location, the tight loop runs as fast as 1.465 seconds, or as slow as 2.000 seconds. The unfolded loop only has slight variation, 1.042 to 1.048 seconds.

I removed the printf statements, and switched to using a debugger to check results, and the performance was unaffected (since it's assembly code). I'm using Visual Studio 2015 on Windows 7 Pro 64 bit to do the builds and do the runs. Switching to AT&T syntax for Linux / Posix could take a while, unless there's something MASM syntax compatible for Linux / Posix.

I'm wondering if other X86 cpu's have similar performance differences. I'm posting this as a follow up in another thread that included performance of a 4 way merge sort that uses gotos.

Sometimes "goto" is useful

Code:

includelib msvcrtd includelib oldnames .data ; multiples of 37 mod 93 + 93 at the end a dq 0,37,74,18,55,92,36,73,17,54 dq 91,35,72,16,53,90,34,71,15,52 dq 89,33,70,14,51,88,32,69,13,50 dq 87,31,68,12,49,86,30,67,11,48 dq 85,29,66,10,47,84,28,65, 9,46 dq 83,27,64, 8,45,82,26,63, 7,44 dq 81,25,62, 6,43,80,24,61, 5,42 dq 79,23,60, 4,41,78,22,59, 3,40 dq 77,21,58, 2,39,76,20,57, 1,38 dq 75,19,56,93 .data? .code ; parameters rcx,rdx,r8,r9 ; not saved rax,rcx,rdx,r8,r9,r10,r11 ; code starts on 16 byte boundary main proc push r15 push r14 push r13 push r12 push rbp mov rbp,rsp and rsp,0fffffffffffffff0h sub rsp,64 mov r15,offset a xor r14,r14 mov r11,0100000h ; nop padding effect on loop version (with 0 padding in padx below) ; 0 puts main2 on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 9 puts main1 on odd 16 byte boundary clk = 01573FE951h => 1.645 seconds rept 0 nop endm rdtsc mov r12,rdx shl r12,32 or r12,rax main0: xor r10,r10 main1: mov rcx,[r10+r15] call fib main2: add r14,rax add r10,8 cmp r10,8*94 jne main1 dec r11 jnz main0 rdtsc mov r13,rdx shl r13,32 or r13,rax sub r13,r12 mov rdx,r14 xor rax,rax mov rsp,rbp pop rbp pop r12 pop r13 pop r14 pop r15 ret main endp align 16 padx proc ; nop padding effect on loop version with 0 padding above ; 0 puts fib on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 16 puts fib on even 16 byte boundary clk = 01A13C8CB8h => 2.000 seconds ; nop padding effect on computed jump version with 9 padding above ; 0 puts fib on odd 16 byte boundary clk = 00D979792Dh => 1.042 seconds ; 16 puts fib on even 16 byte boundary clk = 00DA93E04Dh => 1.048 seconds rept 0 nop endm padx endp if 0 ;0 = loop version, 1 = computed jump version fib proc ;rcx == n mov r8,rcx ;set jmp adr mov r9,offset fib0+279 lea r8,[r8+r8*2] neg r8 add r8,r9 mov rax,rcx ;set rax,rdx mov rdx,1 and rax,rdx sub rdx,rax jmp r8 fib0: ; assumes add xxx,xxx takes 3 bytes rept 46 add rax,rdx add rdx,rax endm add rax,rdx ret fib endp else fib proc ;rcx == n mov rax,rcx ;br if < 2 cmp rax,2 jb fib1 mov rdx,1 ;set rax, rdx and rax,rdx sub rdx,rax shr rcx,1 fib0: add rdx,rax add rax,rdx dec rcx jnz fib0 fib1: ret fib endp endif end

The CPU caches are aligned, meaning that if the loop is big enough, it's likely that some of it will be outside the current cache-line, making it slower. I don't think that's what's going on here though. More likely, the operating system switches the thread around (thus introducing randomness to the branch predictors).

Quote:

Originally Posted by GReaper

The CPU caches are aligned, meaning that if the loop is big enough, it's likely that some of it will be outside the current cache-line, making it slower. I don't think that's what's going on here though. More likely, the operating system switches the thread around (thus introducing randomness to the branch predictors).

Possibly, but Windows non-I/O thread switches only occur on tick boundaries, and at 64 hz, 2 seconds, that is 128 possible thread switches out of 1048576 calls to fib(), or 8192 calls per possible thread switch. The mystery so far is why the sensitivity to code location. I first noticed this with some C programs, but those cases weren't showing as much sensitivity as this case. I switched to all assembly to better control the code location. This could be some type of quirk with the Intel 3770K processor.

Some random thoughts.

intel - How to read performance counters on i5, i7 CPUs - Stack Overflow
Use the performance counters to examine things like number of cache misses, and number of instructions executed.

Processor affinity - Wikipedia
Consider whether Windows is bumping the thread from one processor to another over the time interval, thus losing the benefit of your current cache.

Your process could still be interrupted by hardware interrupts.
Advanced Programmable Interrupt Controller - Wikipedia
Perhaps in conjunction with affinity, choose a core which is less likely to be bothered by the outside world.

Quote:

Originally Posted by Salem

Some random thoughts.

Those are possible explanations for variations in run time, but they don't explain a consistent difference in run time (1.465 seconds versus 2.000 seconds) due to location of code.

Quote:

Originally Posted by rcgldr

Those are possible explanations for variations in run time, but they don't explain a consistent difference in run time (1.465 seconds versus 2.000 seconds) due to location of code.

just have to check what innermost cacheline is in size and align 64 (or whats the size of cacheline) right before innerloop,so all innerloop ends up inside cacheline instead of cpu loads cacheline very often which brings down speed