Optimization issues - minimize branches

**std10093** · 04-21-2013

Delay slot is used, in our course at least, a lot. One requirement of out assignment is to use forwarding and select between delayed slot and BTB (branch target buffer). The second is actually a predictor for whether we are going to take the branch or not and a predictor for the address target of the branch.

However, since we have managed to avoid branches, we use only two. One for the loop and one for the check, so the loss is not big. All this, in my current code, that does 933 cycles. (the one with the loop unrolling). I have BTB enabled for now. Delay slot gives a bit more cycles, but could give a bit less I guess if I make some re-arrangements.

But, since the number of branches is just two, I feel that the loss we have now is, that we use a count array and then we should load the values in r17-r21. Nomimal thought something, which uses slti, that can get us around this. At this thought is where we should focus now, I would say.

**Nominal Animal** · 04-21-2013

Originally Posted by std10093

For the 2147484:
How did you found the upper bound?

There is a formula for it, but I didn't bother rediscovering it or looking it up: I just checked it in a loop (comparing data/1000 to the value obtained using the multiply by reciprocal method) and found the first value, 6100999, that differs.

Originally Posted by std10093

As for the slti you said, in order to get rid of the array of counters and use the required registers, I did not understand you. Maybe the fact that it is 5:50 at the morning has some impact.

Do you want me to post the code I mentioned in post #13? Here's some further hints:

If you remember that

Code:

slti rA, rB, C

is equivalent to

Code:

rA = (rB < C) ? 1 : 0

the basic idea of my loop is

Code:

r2 = value
r16 = r16 + (value < 0) ? 1 : 0
r17 = r17 + (value < 1000) ? 1 : 0
r18 = r18 + (value < 2000) ? 1 : 0
r19 = r19 + (value < 3000) ? 1 : 0
r20 = r20 + (value < 4000) ? 1 : 0
r21 = r21 + (value < 5000) ? 1 : 0
r22 = r22 + 1
Repeat until r22 is equal to number of values

After the above loops,

Code:

r16 = number of values, where value < 0
r17 = number of values, where value < 1000
r18 = number of values, where value < 2000
r19 = number of values, where value < 3000
r20 = number of values, where value < 4000
r21 = number of values, where value < 5000
r22 = number of values, total

At this point, you need to realize that when min < max,

Code:

Count(MIN <= data < MAX) = Count(data < MAX) - Count(data - MIN)

because the count for the upper limit includes all those included in the count for the lower limit. Therefore, doing

Code:

r22 = r22 - r21
r21 = r21 - r20
r20 = r20 - r19
r19 = r19 - r18
r18 = r18 - r17
r17 = r17 - r16

yields

Code:

r16 = number of values, where value < 0
r17 = number of values, where 0 <= value < 1000
r18 = number of values, where 1000 <= value < 2000
r19 = number of values, where 2000 <= value < 3000
r20 = number of values, where 3000 <= value < 4000
r21 = number of values, where 4000 <= value < 5000
r22 = number of values, where 5000 <= value

Note that you need the number of values, in addition to the offset to data array. By interleaving the additions and the slti's and using different registers you can achieve no-stalls WinMIPS64 code.

I'm assuming WinMIPS64 delay slots are similar to speculative execution: they are decoded and executed, except that the results (including any flags affected) may be discarded if the previous branch is taken. According to the documentation, WinMIPS64 looks like a very ..limited?.. subset of the MIPS architecture, so I was unsure on how much of the MIPS documentation applies to WinMIPS64. I used only the WinMIPS64 documentation, assuming standard MIPS documentation might not apply to WinMIPS64.

(I had heard of Windows MIPS emulators before. It's not something I would voluntarily use for teaching.)

(Oh, and if it matters, I obviously did check my code, and the results it yields, in the WinMPS64 emulator. I do use Linux, but WinMIPS64 runs under Wine just fine, aside from some visual glitches.)

**std10093** · 04-21-2013

I am aware of slt. I have used it widely.Well, since Qtspim was the simulator we used in Architecture 1 and this is the Architecture 2, I used both documentations and so far did not have problems.

I see what you did with the number. Pretty good.

No don't post code and I think you feel why

**std10093** · 04-21-2013

I got 1629 cycles (with the classical 2 branch taken and 2 branch misprediction stalls :P )
The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance. The best performance now is 933 cycles. There, we pass 5 elements at a time and because mul takes 7 stages to be executed, we make a big improvement by unrolling the loop.
With this approach? Does it worth it?

**Salem** · 04-21-2013

How big is the level 1 instruction cache?
If you unroll too much, the cache might not keep up with a long run of branchless code, and the branch (when it does come) will be another cache miss as well.

**Nominal Animal** · 04-21-2013

Originally Posted by std10093

No don't post code and I think you feel why

Absolutely. I only wanted you to know the code does exist, and runs with the timings I said in my post, using WinMIPS64. (As you probably know, I prefer to explain or show the basis of my claims, rather than just claim something and assume others take my word for it.)

Originally Posted by std10093

The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance.

In practice, this is solved by having two (or more) versions of the same function: One for small amounts of data (no loop unrolling, no expensive initialization and set up), and one for large amounts of data (loop unrolling, possibly more expensive initialization to speed up individual iterations). You pick the one to use based on the amount of data.

The tipping point is best found experimentally, using real-world data, and real-world architectures, and checking for any possible pathological cases. (For example, if this was a function that gets called often without any data at all, it might be worth it to check right at the bat whether there is any data at all to work on or not. If there is, then check if there is enough to warrant the use of the loop-unrolled version instead of the normal version.)

If you write a loop that unrolls say eight iterations into one, then you also need to check if the number of data elements is divisible by eight. You need to handle those excess elements (0, 1, 2, .., up to 7 excess elements) before or after the unrolled loop.

If your unrolled loop consists of the exact same code repeated for each input element, with just conditional checks and branches removed (and possibly permuting the registers used), you can use Duff's device to jump into the middle of the first iteration to handle those excess elements.

(While MIPS can do computed jumps using the jr instruction, WinMIPS64 does not seem to have a way to use label addresses as array data initializers -- on other words, so that you could store a list of labels (addresses) in a data array, then use that array to pick a jump target address. It is supported by many if not most assemblers, and is a common technique to implement C switch() equivalent code. On WinMIPS64 you can obtaining the addresses of the labels and save them into an array at run time, say on the first run, but it's just not worth the effort.)

A very common way to handle the excess elements not handled by the unrolled loop is to let the one-element-per-iteration variant handle those excess elements. With a bit of planning (wrt. register use and so on), they mesh together usually very well.

**std10093** · 04-22-2013

Very true what salem says, but at this point, we have no cache.

Well, nomimal, you are a member I can trust, so if you say "I got these results by some code based on the idea I described before", then I believe you.

I decided not to unroll the loop for the slti method, because it can be better, but not enough to beat the 933 cycles (which are going to be reduced, at least, at 931, thus maybe they are going to decrease a bit more) of the way that Salem described, but used the method Nomimal said with the reciprocicals.

You see, Salem and Nomimal are two great people (well not so great as the Great Alexander, but you get my point ;p ) and when they share ideas, then really beautiful ideas come to play. I really learned a lot from this. Tonight, I am going to produce my pdf with Latex and submit the assignment. Bravo to everybody

**phantomotap** · 04-22-2013

well not so great as the Great Alexander

>_<

You've never seen Salem dance!

Soma (doot deet do de doo de doot doo)

**std10093** · 04-22-2013

haha, I admit I did not have in mind that factor... Hmmm... As a penalty I am going to fully unroll the loop in the code with the 933-- cycles..

**std10093** · 04-22-2013

Did not worth it. I would take 900 cycles (for N = 98 = 7 * 14). So I would have to add some more lines of code for the two remaining elements of the array.

So, I think that we can close at 930 cycles

**std10093** · 04-26-2013

Finally, the code runs in 826 cycles. Thank you Salem and Nomimal

**Mr_and_Mrs_D** · 05-13-2013

Deadline for homework is behind us so I can post code can't I ?

The solution discussed here is :

Code:

.data
matrix:    .word 1000,4246,4850,54,3945,3192,3436,2586,2189,925,3814
    .word 2363,2464,101,1362,2341,1596,4172,2929,2677,2435
    .word 4033,2225,2598,4070,704,2052,64,4682,2964,1332
    .word 2879,772,3829,170,3475,1261,153,1005,497,1721
    .word 162,4664,4493,4786,1693,2876,2530,3177,2706,1229
    .word 208,2332,1061,3585,4145,4795,1071,890,504,4109
    .word 2761,3877,2894,4591,807,3651,3186,3732,2085,557
    .word 4173,145,4681,4060,599,3391,4820,166,2273,1003
    .word 2230,1309,3008,850,4509,505,4826,4726,2898,1718
    .word 3497,1358,2272,3752,2433,4577,3545,4147,3512
; s1:14
; s2:f
; s3:19
; s4:12
; s5:16
hits:    .word 0,0,0,0,0,0,0,0
CONTROL:    .word32 0x10000
DATA:    .word32 0x10008
DIVIDE:    .word 0x20C49C ; if I move this below the next line it won't load it (???)
mes:    .asciiz "Invalid number \:\n"    ; error message

.code
    lwu $t4,DIVIDE($zero)    ; t4 == 2^31 / 1000
    daddi $t1, $zero, matrix    ; t1 == start of the array
    daddi $s7, $t1, 800    ; s7 == end of the array
    daddi $t3, $zero, hits    ; t3 == start of the hits array
    first_iteration:
        lwu $t0,($t1)    ; load value to t0
        daddi $t1, $t1, 8    ; increment address
        dmulu $t2, $t0, $t4 ; multiply by 2147484 - ! unsigned
        dsra $t8, $t2, 31 ; right shift
        ; up till now we have divided by 1000 - valid for the range we care
    loop:
        lwu $t0,($t1)    ; load value to t0
        slti $t7, $t8, 5    ; meanwhile check for error
        dmulu $t2, $t0, $t4 ; multiply by 2147484 - ! unsigned
        ; meanwhile store the previous result !
        dsll $t6, $t8, 3    ; multiply by 8 to get the offset
        lwu $t5, hits($t6) ; load the value from hits
        beqz $t7, error    ; if t2 was more than 4 error
        daddi $t1, $t1, 8    ; increment address
        daddi $t5, $t5, 1    ; increment it ! STRUCTURAL HAZARD
        dsra $t8, $t2, 31 ; right shift
        ; up till now we have divided by 1000
        sd $t5, hits($t6)    ; store it back
        ; result stored !
    condition:
        bne $t1,$s7,loop
    ;store the last result and check for error !
    slti $t7, $t8, 5    ; meanwhile check for error
    dsll $t6, $t8, 3    ; multiply by 8 to get the offset
    beqz $t7, error    ; if t2 was more than 4 error !
    lwu $t5, hits($t6) ; load the value from hits
    daddi $t5, $t5, 1    ; increment it
    sd $t5, hits($t6)    ; store it back
    ; load the registers
    lwu $s1, ($t3)
    lwu $s2, 8($t3)
    lwu $s3, 16($t3)
    lwu $s4, 24($t3)
    lwu $s5, 32($t3)
    halt
    error:
        ; needed for printing
        lwu $s0,DATA($zero)    ; $s0 = address of DATA register
        lwu $s6,CONTROL($zero)    ; $s6 = address of CONTROL register
        ; end printing
        daddi $v0,$zero,4       ; set for ascii output
        daddi $t0,$zero,mes
        sd $t0,0($s0)           ; write address of message to DATA register
        sd $v0,0($s6)           ; make it happen
        daddi $t1, $t1, -8    ; decrement address
        ld $t0,($t1)    ; load value to t0
        daddi $v0,$zero,2    ; set for signed integer output
        sd $t0,0($s0)           ; write integer to DATA register
        sd $v0,0($s6)           ; make it happen
    halt

Runs in 1223 cycles with branch buffer ON - but notice I abort on error, which adds error checking while in the loop ! The checks for < 0 are not needed as in the unsigned arithmetics negative numbers are much larger than 5000. The code looks a bit messed up cause I moved instructions around to avoid RAW, WAW etc hazards. Namely I store the results and error check _on the next iteration of the loop_ so I am kept busy during multiplication.

My question is - can I avoid the structural hazard in the loop body ?

Thread: Optimization issues - minimize branches

Thread Tools

Search Thread

Display

Similar Threads

How to count Branches

redraw after minimize

minimize and maximize

Minimize itself?