Thread: Intel 3770K ~35% time difference in tight loop depending on location - other cpus?

  1. #1
    Registered User
    Join Date
    Apr 2013
    Posts
    1,645

    Intel 3770K ~35% time difference in tight loop depending on location - other cpus?

    The code below is X86 64 bit assembly code to calculate the sum of Fib(0) to Fib(93). It includes a conditional to use either a tight loop or an unfolded loop with a computed jump to enter the unfolded loop similar to C switch / case. Originally, it was meant to see the performance improvement of unfolding a loop, but during the testing, it turns out that the tight loop code is very sensitive to where the code is located. The table with multiples of 37 % 93 is used mostly to vary the jumps on the unfolded code.

    The code is run 2^20 = 1048576 times for a bench mark, so this is actually a nested loop where the inner loop runs from 0 to 93 times, for an average of ~46 (since 0 and 1 inputs skip the inner loop). Even if the outer and inner loop have a cache conflict, since the inner loop is run an average of ~46 times, it shouldn't have as much impact on the timing that I'm seeing.

    The times are noted below, but for the tight loop, depending on code location, the tight loop runs as fast as 1.465 seconds, or as slow as 2.000 seconds. The unfolded loop only has slight variation, 1.042 to 1.048 seconds.

    I removed the printf statements, and switched to using a debugger to check results, and the performance was unaffected (since it's assembly code). I'm using Visual Studio 2015 on Windows 7 Pro 64 bit to do the builds and do the runs. Switching to AT&T syntax for Linux / Posix could take a while, unless there's something MASM syntax compatible for Linux / Posix.

    I'm wondering if other X86 cpu's have similar performance differences. I'm posting this as a follow up in another thread that included performance of a 4 way merge sort that uses gotos.

    Sometimes "goto" is useful

    Code:
            includelib msvcrtd
            includelib oldnames
    
            .data
    ; multiples of 37 mod 93 + 93 at the end
    a       dq      0,37,74,18,55,92,36,73,17,54
            dq     91,35,72,16,53,90,34,71,15,52
            dq     89,33,70,14,51,88,32,69,13,50
            dq     87,31,68,12,49,86,30,67,11,48
            dq     85,29,66,10,47,84,28,65, 9,46
            dq     83,27,64, 8,45,82,26,63, 7,44
            dq     81,25,62, 6,43,80,24,61, 5,42
            dq     79,23,60, 4,41,78,22,59, 3,40
            dq     77,21,58, 2,39,76,20,57, 1,38
            dq     75,19,56,93
            .data?
            .code
    ;       parameters      rcx,rdx,r8,r9
    ;       not saved       rax,rcx,rdx,r8,r9,r10,r11
    ;       code starts on 16 byte boundary
    main    proc
            push    r15
            push    r14
            push    r13
            push    r12
            push    rbp
            mov     rbp,rsp
            and     rsp,0fffffffffffffff0h
            sub     rsp,64
            mov     r15,offset a
            xor     r14,r14
            mov     r11,0100000h
    ;       nop padding effect on loop version (with 0 padding in padx below)
    ;        0 puts main2 on  odd 16 byte boundary  clk = 0131876622h => 1.465 seconds
    ;        9 puts main1 on  odd 16 byte boundary  clk = 01573FE951h => 1.645 seconds
            rept    0
            nop
            endm
            rdtsc
            mov     r12,rdx
            shl     r12,32
            or      r12,rax
    main0:  xor     r10,r10
    main1:  mov     rcx,[r10+r15]
            call    fib
    main2:  add     r14,rax
            add     r10,8
            cmp     r10,8*94
            jne     main1
            dec     r11
            jnz     main0
            rdtsc
            mov     r13,rdx
            shl     r13,32
            or      r13,rax
            sub     r13,r12
            mov     rdx,r14
            xor     rax,rax
            mov     rsp,rbp
            pop     rbp
            pop     r12
            pop     r13
            pop     r14
            pop     r15
            ret
    main    endp
    
            align   16
    padx    proc
    ;       nop padding effect on loop version with 0 padding above
    ;        0 puts fib on  odd 16 byte boundary    clk = 0131876622h => 1.465 seconds
    ;       16 puts fib on even 16 byte boundary    clk = 01A13C8CB8h => 2.000 seconds
    ;       nop padding effect on computed jump version with 9 padding above
    ;        0 puts fib on  odd 16 byte boundary    clk = 00D979792Dh => 1.042 seconds
    ;       16 puts fib on even 16 byte boundary    clk = 00DA93E04Dh => 1.048 seconds
            rept    0
            nop
            endm
    padx    endp
    
            if      0       ;0 = loop version, 1 = computed jump version
    
    fib     proc                            ;rcx == n
            mov     r8,rcx                  ;set jmp adr
            mov     r9,offset fib0+279
            lea     r8,[r8+r8*2]
            neg     r8
            add     r8,r9
            mov     rax,rcx                 ;set rax,rdx
            mov     rdx,1
            and     rax,rdx
            sub     rdx,rax
            jmp     r8
    fib0:   ; assumes add xxx,xxx takes 3 bytes
            rept    46
            add     rax,rdx
            add     rdx,rax
            endm
            add     rax,rdx
            ret
    fib     endp
    
            else
    
    fib     proc                            ;rcx == n
            mov     rax,rcx                 ;br if < 2
            cmp     rax,2
            jb      fib1
            mov     rdx,1                   ;set rax, rdx
            and     rax,rdx
            sub     rdx,rax
            shr     rcx,1
    fib0:   add     rdx,rax
            add     rax,rdx
            dec     rcx
            jnz     fib0
    fib1:   ret     
    fib     endp
    
            endif
            end
    Last edited by rcgldr; 08-05-2019 at 03:52 AM.

  2. #2
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,722
    The CPU caches are aligned, meaning that if the loop is big enough, it's likely that some of it will be outside the current cache-line, making it slower. I don't think that's what's going on here though. More likely, the operating system switches the thread around (thus introducing randomness to the branch predictors).
    Devoted my life to programming...

  3. #3
    Registered User
    Join Date
    Apr 2013
    Posts
    1,645
    Quote Originally Posted by GReaper View Post
    The CPU caches are aligned, meaning that if the loop is big enough, it's likely that some of it will be outside the current cache-line, making it slower. I don't think that's what's going on here though. More likely, the operating system switches the thread around (thus introducing randomness to the branch predictors).
    Possibly, but Windows non-I/O thread switches only occur on tick boundaries, and at 64 hz, 2 seconds, that is 128 possible thread switches out of 1048576 calls to fib(), or 8192 calls per possible thread switch. The mystery so far is why the sensitivity to code location. I first noticed this with some C programs, but those cases weren't showing as much sensitivity as this case. I switched to all assembly to better control the code location. This could be some type of quirk with the Intel 3770K processor.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,378
    Some random thoughts.

    intel - How to read performance counters on i5, i7 CPUs - Stack Overflow
    Use the performance counters to examine things like number of cache misses, and number of instructions executed.

    Processor affinity - Wikipedia
    Consider whether Windows is bumping the thread from one processor to another over the time interval, thus losing the benefit of your current cache.

    Your process could still be interrupted by hardware interrupts.
    Advanced Programmable Interrupt Controller - Wikipedia
    Perhaps in conjunction with affinity, choose a core which is less likely to be bothered by the outside world.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Apr 2013
    Posts
    1,645
    Quote Originally Posted by Salem View Post
    Some random thoughts.
    Those are possible explanations for variations in run time, but they don't explain a consistent difference in run time (1.465 seconds versus 2.000 seconds) due to location of code.

  6. #6
    Registered User
    Join Date
    Apr 2019
    Posts
    69
    Quote Originally Posted by rcgldr View Post
    Those are possible explanations for variations in run time, but they don't explain a consistent difference in run time (1.465 seconds versus 2.000 seconds) due to location of code.
    just have to check what innermost cacheline is in size and align 64 (or whats the size of cacheline) right before innerloop,so all innerloop ends up inside cacheline instead of cpu loads cacheline very often which brings down speed
    How do you know when you are a Guru?
    When you can (code) C even blindfolded

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. segmentation fault depending on location of file
    By Layla_E in forum C Programming
    Replies: 16
    Last Post: 03-07-2019, 12:46 PM
  2. what difference between address and memory location?
    By Shang Roy in forum C Programming
    Replies: 1
    Last Post: 10-03-2012, 05:45 PM
  3. Replies: 5
    Last Post: 03-23-2012, 12:48 PM
  4. Bidirectional loop? (Increase/decrease depending on values)
    By BlueGooGames in forum C Programming
    Replies: 4
    Last Post: 01-08-2011, 12:08 PM
  5. Replies: 3
    Last Post: 11-15-2003, 11:20 AM

Tags for this Thread