Thread: Optimization issues - minimize branches

  1. #16
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Delay slot is used, in our course at least, a lot. One requirement of out assignment is to use forwarding and select between delayed slot and BTB (branch target buffer). The second is actually a predictor for whether we are going to take the branch or not and a predictor for the address target of the branch.

    However, since we have managed to avoid branches, we use only two. One for the loop and one for the check, so the loss is not big. All this, in my current code, that does 933 cycles. (the one with the loop unrolling). I have BTB enabled for now. Delay slot gives a bit more cycles, but could give a bit less I guess if I make some re-arrangements.

    But, since the number of branches is just two, I feel that the loss we have now is, that we use a count array and then we should load the values in r17-r21. Nomimal thought something, which uses slti, that can get us around this. At this thought is where we should focus now, I would say.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  2. #17
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by std10093 View Post
    For the 2147484:
    How did you found the upper bound?
    There is a formula for it, but I didn't bother rediscovering it or looking it up: I just checked it in a loop (comparing data/1000 to the value obtained using the multiply by reciprocal method) and found the first value, 6100999, that differs.

    Quote Originally Posted by std10093 View Post
    As for the slti you said, in order to get rid of the array of counters and use the required registers, I did not understand you. Maybe the fact that it is 5:50 at the morning has some impact.
    Do you want me to post the code I mentioned in post #13? Here's some further hints:

    If you remember that
    Code:
    slti rA, rB, C
    is equivalent to
    Code:
    rA = (rB < C) ? 1 : 0
    the basic idea of my loop is
    Code:
    r2 = value
    r16 = r16 + (value < 0) ? 1 : 0
    r17 = r17 + (value < 1000) ? 1 : 0
    r18 = r18 + (value < 2000) ? 1 : 0
    r19 = r19 + (value < 3000) ? 1 : 0
    r20 = r20 + (value < 4000) ? 1 : 0
    r21 = r21 + (value < 5000) ? 1 : 0
    r22 = r22 + 1
    Repeat until r22 is equal to number of values
    After the above loops,
    Code:
    r16 = number of values, where value < 0
    r17 = number of values, where value < 1000
    r18 = number of values, where value < 2000
    r19 = number of values, where value < 3000
    r20 = number of values, where value < 4000
    r21 = number of values, where value < 5000
    r22 = number of values, total
    At this point, you need to realize that when min < max,
    Code:
    Count(MIN <= data < MAX) = Count(data < MAX) - Count(data - MIN)
    because the count for the upper limit includes all those included in the count for the lower limit. Therefore, doing
    Code:
    r22 = r22 - r21
    r21 = r21 - r20
    r20 = r20 - r19
    r19 = r19 - r18
    r18 = r18 - r17
    r17 = r17 - r16
    yields
    Code:
    r16 = number of values, where value < 0
    r17 = number of values, where 0 <= value < 1000
    r18 = number of values, where 1000 <= value < 2000
    r19 = number of values, where 2000 <= value < 3000
    r20 = number of values, where 3000 <= value < 4000
    r21 = number of values, where 4000 <= value < 5000
    r22 = number of values, where 5000 <= value
    Note that you need the number of values, in addition to the offset to data array. By interleaving the additions and the slti's and using different registers you can achieve no-stalls WinMIPS64 code.

    I'm assuming WinMIPS64 delay slots are similar to speculative execution: they are decoded and executed, except that the results (including any flags affected) may be discarded if the previous branch is taken. According to the documentation, WinMIPS64 looks like a very ..limited?.. subset of the MIPS architecture, so I was unsure on how much of the MIPS documentation applies to WinMIPS64. I used only the WinMIPS64 documentation, assuming standard MIPS documentation might not apply to WinMIPS64.

    (I had heard of Windows MIPS emulators before. It's not something I would voluntarily use for teaching.)

    (Oh, and if it matters, I obviously did check my code, and the results it yields, in the WinMPS64 emulator. I do use Linux, but WinMIPS64 runs under Wine just fine, aside from some visual glitches.)

  3. #18
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    I am aware of slt. I have used it widely.Well, since Qtspim was the simulator we used in Architecture 1 and this is the Architecture 2, I used both documentations and so far did not have problems.

    I see what you did with the number. Pretty good.

    No don't post code and I think you feel why
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  4. #19
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    I got 1629 cycles (with the classical 2 branch taken and 2 branch misprediction stalls :P )
    The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance. The best performance now is 933 cycles. There, we pass 5 elements at a time and because mul takes 7 stages to be executed, we make a big improvement by unrolling the loop.
    With this approach? Does it worth it?
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  5. #20
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    How big is the level 1 instruction cache?
    If you unroll too much, the cache might not keep up with a long run of branchless code, and the branch (when it does come) will be another cache miss as well.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #21
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by std10093 View Post
    No don't post code and I think you feel why
    Absolutely. I only wanted you to know the code does exist, and runs with the timings I said in my post, using WinMIPS64. (As you probably know, I prefer to explain or show the basis of my claims, rather than just claim something and assume others take my word for it.)
    Quote Originally Posted by std10093 View Post
    The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance.
    In practice, this is solved by having two (or more) versions of the same function: One for small amounts of data (no loop unrolling, no expensive initialization and set up), and one for large amounts of data (loop unrolling, possibly more expensive initialization to speed up individual iterations). You pick the one to use based on the amount of data.

    The tipping point is best found experimentally, using real-world data, and real-world architectures, and checking for any possible pathological cases. (For example, if this was a function that gets called often without any data at all, it might be worth it to check right at the bat whether there is any data at all to work on or not. If there is, then check if there is enough to warrant the use of the loop-unrolled version instead of the normal version.)

    If you write a loop that unrolls say eight iterations into one, then you also need to check if the number of data elements is divisible by eight. You need to handle those excess elements (0, 1, 2, .., up to 7 excess elements) before or after the unrolled loop.

    If your unrolled loop consists of the exact same code repeated for each input element, with just conditional checks and branches removed (and possibly permuting the registers used), you can use Duff's device to jump into the middle of the first iteration to handle those excess elements.

    (While MIPS can do computed jumps using the jr instruction, WinMIPS64 does not seem to have a way to use label addresses as array data initializers -- on other words, so that you could store a list of labels (addresses) in a data array, then use that array to pick a jump target address. It is supported by many if not most assemblers, and is a common technique to implement C switch() equivalent code. On WinMIPS64 you can obtaining the addresses of the labels and save them into an array at run time, say on the first run, but it's just not worth the effort.)

    A very common way to handle the excess elements not handled by the unrolled loop is to let the one-element-per-iteration variant handle those excess elements. With a bit of planning (wrt. register use and so on), they mesh together usually very well.

  7. #22
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Very true what salem says, but at this point, we have no cache.

    Well, nomimal, you are a member I can trust, so if you say "I got these results by some code based on the idea I described before", then I believe you.

    I decided not to unroll the loop for the slti method, because it can be better, but not enough to beat the 933 cycles (which are going to be reduced, at least, at 931, thus maybe they are going to decrease a bit more) of the way that Salem described, but used the method Nomimal said with the reciprocicals.

    You see, Salem and Nomimal are two great people (well not so great as the Great Alexander, but you get my point ;p ) and when they share ideas, then really beautiful ideas come to play. I really learned a lot from this. Tonight, I am going to produce my pdf with Latex and submit the assignment. Bravo to everybody
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  8. #23
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    well not so great as the Great Alexander
    >_<

    You've never seen Salem dance!

    Soma (doot deet do de doo de doot doo)

  9. #24
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    haha, I admit I did not have in mind that factor... Hmmm... As a penalty I am going to fully unroll the loop in the code with the 933-- cycles..
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  10. #25
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Did not worth it. I would take 900 cycles (for N = 98 = 7 * 14). So I would have to add some more lines of code for the two remaining elements of the array.

    So, I think that we can close at 930 cycles
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  11. #26
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Finally, the code runs in 826 cycles. Thank you Salem and Nomimal
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  12. #27
    Registered User
    Join Date
    May 2013
    Posts
    1
    Deadline for homework is behind us so I can post code can't I ?

    The solution discussed here is :

    Code:
    .data
    matrix:    .word 1000,4246,4850,54,3945,3192,3436,2586,2189,925,3814
        .word 2363,2464,101,1362,2341,1596,4172,2929,2677,2435
        .word 4033,2225,2598,4070,704,2052,64,4682,2964,1332
        .word 2879,772,3829,170,3475,1261,153,1005,497,1721
        .word 162,4664,4493,4786,1693,2876,2530,3177,2706,1229
        .word 208,2332,1061,3585,4145,4795,1071,890,504,4109
        .word 2761,3877,2894,4591,807,3651,3186,3732,2085,557
        .word 4173,145,4681,4060,599,3391,4820,166,2273,1003
        .word 2230,1309,3008,850,4509,505,4826,4726,2898,1718
        .word 3497,1358,2272,3752,2433,4577,3545,4147,3512
    ; s1:14
    ; s2:f
    ; s3:19
    ; s4:12
    ; s5:16
    hits:    .word 0,0,0,0,0,0,0,0
    CONTROL:    .word32 0x10000
    DATA:    .word32 0x10008
    DIVIDE:    .word 0x20C49C ; if I move this below the next line it won't load it (???)
    mes:    .asciiz "Invalid number \:\n"    ; error message
    
    .code
        lwu $t4,DIVIDE($zero)    ; t4 == 2^31 / 1000
        daddi $t1, $zero, matrix    ; t1 == start of the array
        daddi $s7, $t1, 800    ; s7 == end of the array
        daddi $t3, $zero, hits    ; t3 == start of the hits array
        first_iteration:
            lwu $t0,($t1)    ; load value to t0
            daddi $t1, $t1, 8    ; increment address
            dmulu $t2, $t0, $t4 ; multiply by 2147484 - ! unsigned
            dsra $t8, $t2, 31 ; right shift
            ; up till now we have divided by 1000 - valid for the range we care
        loop:
            lwu $t0,($t1)    ; load value to t0
            slti $t7, $t8, 5    ; meanwhile check for error
            dmulu $t2, $t0, $t4 ; multiply by 2147484 - ! unsigned
            ; meanwhile store the previous result !
            dsll $t6, $t8, 3    ; multiply by 8 to get the offset
            lwu $t5, hits($t6) ; load the value from hits
            beqz $t7, error    ; if t2 was more than 4 error
            daddi $t1, $t1, 8    ; increment address
            daddi $t5, $t5, 1    ; increment it ! STRUCTURAL HAZARD
            dsra $t8, $t2, 31 ; right shift
            ; up till now we have divided by 1000
            sd $t5, hits($t6)    ; store it back
            ; result stored !
        condition:
            bne $t1,$s7,loop
        ;store the last result and check for error !
        slti $t7, $t8, 5    ; meanwhile check for error
        dsll $t6, $t8, 3    ; multiply by 8 to get the offset
        beqz $t7, error    ; if t2 was more than 4 error !
        lwu $t5, hits($t6) ; load the value from hits
        daddi $t5, $t5, 1    ; increment it
        sd $t5, hits($t6)    ; store it back
        ; load the registers
        lwu $s1, ($t3)
        lwu $s2, 8($t3)
        lwu $s3, 16($t3)
        lwu $s4, 24($t3)
        lwu $s5, 32($t3)
        halt
        error:
            ; needed for printing
            lwu $s0,DATA($zero)    ; $s0 = address of DATA register
            lwu $s6,CONTROL($zero)    ; $s6 = address of CONTROL register
            ; end printing
            daddi $v0,$zero,4       ; set for ascii output
            daddi $t0,$zero,mes
            sd $t0,0($s0)           ; write address of message to DATA register
            sd $v0,0($s6)           ; make it happen
            daddi $t1, $t1, -8    ; decrement address
            ld $t0,($t1)    ; load value to t0
            daddi $v0,$zero,2    ; set for signed integer output
            sd $t0,0($s0)           ; write integer to DATA register
            sd $v0,0($s6)           ; make it happen
        halt
    Runs in 1223 cycles with branch buffer ON - but notice I abort on error, which adds error checking while in the loop ! The checks for < 0 are not needed as in the unsigned arithmetics negative numbers are much larger than 5000. The code looks a bit messed up cause I moved instructions around to avoid RAW, WAW etc hazards. Namely I store the results and error check _on the next iteration of the loop_ so I am kept busy during multiplication.

    My question is - can I avoid the structural hazard in the loop body ?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to count Branches
    By husslela2 in forum C Programming
    Replies: 35
    Last Post: 04-28-2010, 12:33 AM
  2. redraw after minimize
    By Prads in forum Windows Programming
    Replies: 6
    Last Post: 03-21-2010, 11:16 AM
  3. minimize and maximize
    By Wick in forum Windows Programming
    Replies: 2
    Last Post: 08-20-2003, 10:37 PM
  4. Minimize itself?
    By MassKid in forum C Programming
    Replies: 4
    Last Post: 02-06-2002, 06:47 AM