# Optimization issues - minimize branches

This is a discussion on Optimization issues - minimize branches within the Tech Board forums, part of the Community Boards category; Delay slot is used, in our course at least, a lot. One requirement of out assignment is to use forwarding ...

1. Delay slot is used, in our course at least, a lot. One requirement of out assignment is to use forwarding and select between delayed slot and BTB (branch target buffer). The second is actually a predictor for whether we are going to take the branch or not and a predictor for the address target of the branch.

However, since we have managed to avoid branches, we use only two. One for the loop and one for the check, so the loss is not big. All this, in my current code, that does 933 cycles. (the one with the loop unrolling). I have BTB enabled for now. Delay slot gives a bit more cycles, but could give a bit less I guess if I make some re-arrangements.

But, since the number of branches is just two, I feel that the loss we have now is, that we use a count array and then we should load the values in r17-r21. Nomimal thought something, which uses slti, that can get us around this. At this thought is where we should focus now, I would say.

2. Originally Posted by std10093
For the 2147484:
How did you found the upper bound?
There is a formula for it, but I didn't bother rediscovering it or looking it up: I just checked it in a loop (comparing data/1000 to the value obtained using the multiply by reciprocal method) and found the first value, 6100999, that differs.

Originally Posted by std10093
As for the slti you said, in order to get rid of the array of counters and use the required registers, I did not understand you. Maybe the fact that it is 5:50 at the morning has some impact.
Do you want me to post the code I mentioned in post #13? Here's some further hints:

If you remember that
Code:
`slti rA, rB, C`
is equivalent to
Code:
`rA = (rB < C) ? 1 : 0`
the basic idea of my loop is
Code:
```r2 = value
r16 = r16 + (value < 0) ? 1 : 0
r17 = r17 + (value < 1000) ? 1 : 0
r18 = r18 + (value < 2000) ? 1 : 0
r19 = r19 + (value < 3000) ? 1 : 0
r20 = r20 + (value < 4000) ? 1 : 0
r21 = r21 + (value < 5000) ? 1 : 0
r22 = r22 + 1
Repeat until r22 is equal to number of values```
After the above loops,
Code:
```r16 = number of values, where value < 0
r17 = number of values, where value < 1000
r18 = number of values, where value < 2000
r19 = number of values, where value < 3000
r20 = number of values, where value < 4000
r21 = number of values, where value < 5000
r22 = number of values, total```
At this point, you need to realize that when min < max,
Code:
`Count(MIN <= data < MAX) = Count(data < MAX) - Count(data - MIN)`
because the count for the upper limit includes all those included in the count for the lower limit. Therefore, doing
Code:
```r22 = r22 - r21
r21 = r21 - r20
r20 = r20 - r19
r19 = r19 - r18
r18 = r18 - r17
r17 = r17 - r16```
yields
Code:
```r16 = number of values, where value < 0
r17 = number of values, where 0 <= value < 1000
r18 = number of values, where 1000 <= value < 2000
r19 = number of values, where 2000 <= value < 3000
r20 = number of values, where 3000 <= value < 4000
r21 = number of values, where 4000 <= value < 5000
r22 = number of values, where 5000 <= value```
Note that you need the number of values, in addition to the offset to data array. By interleaving the additions and the slti's and using different registers you can achieve no-stalls WinMIPS64 code.

I'm assuming WinMIPS64 delay slots are similar to speculative execution: they are decoded and executed, except that the results (including any flags affected) may be discarded if the previous branch is taken. According to the documentation, WinMIPS64 looks like a very ..limited?.. subset of the MIPS architecture, so I was unsure on how much of the MIPS documentation applies to WinMIPS64. I used only the WinMIPS64 documentation, assuming standard MIPS documentation might not apply to WinMIPS64.

(I had heard of Windows MIPS emulators before. It's not something I would voluntarily use for teaching.)

(Oh, and if it matters, I obviously did check my code, and the results it yields, in the WinMPS64 emulator. I do use Linux, but WinMIPS64 runs under Wine just fine, aside from some visual glitches.)

3. I am aware of slt. I have used it widely.Well, since Qtspim was the simulator we used in Architecture 1 and this is the Architecture 2, I used both documentations and so far did not have problems.

I see what you did with the number. Pretty good.

No don't post code and I think you feel why

4. I got 1629 cycles (with the classical 2 branch taken and 2 branch misprediction stalls :P )
The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance. The best performance now is 933 cycles. There, we pass 5 elements at a time and because mul takes 7 stages to be executed, we make a big improvement by unrolling the loop.
With this approach? Does it worth it?

5. How big is the level 1 instruction cache?
If you unroll too much, the cache might not keep up with a long run of branchless code, and the branch (when it does come) will be another cache miss as well.

6. Originally Posted by std10093
No don't post code and I think you feel why
Absolutely. I only wanted you to know the code does exist, and runs with the timings I said in my post, using WinMIPS64. (As you probably know, I prefer to explain or show the basis of my claims, rather than just claim something and assume others take my word for it.)
Originally Posted by std10093
The problem is that in my mind, I can not predict that if I unroll the loop, I will get better performance.
In practice, this is solved by having two (or more) versions of the same function: One for small amounts of data (no loop unrolling, no expensive initialization and set up), and one for large amounts of data (loop unrolling, possibly more expensive initialization to speed up individual iterations). You pick the one to use based on the amount of data.

The tipping point is best found experimentally, using real-world data, and real-world architectures, and checking for any possible pathological cases. (For example, if this was a function that gets called often without any data at all, it might be worth it to check right at the bat whether there is any data at all to work on or not. If there is, then check if there is enough to warrant the use of the loop-unrolled version instead of the normal version.)

If you write a loop that unrolls say eight iterations into one, then you also need to check if the number of data elements is divisible by eight. You need to handle those excess elements (0, 1, 2, .., up to 7 excess elements) before or after the unrolled loop.

If your unrolled loop consists of the exact same code repeated for each input element, with just conditional checks and branches removed (and possibly permuting the registers used), you can use Duff's device to jump into the middle of the first iteration to handle those excess elements.

(While MIPS can do computed jumps using the jr instruction, WinMIPS64 does not seem to have a way to use label addresses as array data initializers -- on other words, so that you could store a list of labels (addresses) in a data array, then use that array to pick a jump target address. It is supported by many if not most assemblers, and is a common technique to implement C switch() equivalent code. On WinMIPS64 you can obtaining the addresses of the labels and save them into an array at run time, say on the first run, but it's just not worth the effort.)

A very common way to handle the excess elements not handled by the unrolled loop is to let the one-element-per-iteration variant handle those excess elements. With a bit of planning (wrt. register use and so on), they mesh together usually very well.

7. Very true what salem says, but at this point, we have no cache.

Well, nomimal, you are a member I can trust, so if you say "I got these results by some code based on the idea I described before", then I believe you.

I decided not to unroll the loop for the slti method, because it can be better, but not enough to beat the 933 cycles (which are going to be reduced, at least, at 931, thus maybe they are going to decrease a bit more) of the way that Salem described, but used the method Nomimal said with the reciprocicals.

You see, Salem and Nomimal are two great people (well not so great as the Great Alexander, but you get my point ;p ) and when they share ideas, then really beautiful ideas come to play. I really learned a lot from this. Tonight, I am going to produce my pdf with Latex and submit the assignment. Bravo to everybody

8. well not so great as the Great Alexander
>_<

You've never seen Salem dance!

Soma (doot deet do de doo de doot doo)

9. haha, I admit I did not have in mind that factor... Hmmm... As a penalty I am going to fully unroll the loop in the code with the 933-- cycles..

10. Did not worth it. I would take 900 cycles (for N = 98 = 7 * 14). So I would have to add some more lines of code for the two remaining elements of the array.

So, I think that we can close at 930 cycles

11. Finally, the code runs in 826 cycles. Thank you Salem and Nomimal

12. Deadline for homework is behind us so I can post code can't I ?

The solution discussed here is :

Code:
```.data
matrix:    .word 1000,4246,4850,54,3945,3192,3436,2586,2189,925,3814
.word 2363,2464,101,1362,2341,1596,4172,2929,2677,2435
.word 4033,2225,2598,4070,704,2052,64,4682,2964,1332
.word 2879,772,3829,170,3475,1261,153,1005,497,1721
.word 162,4664,4493,4786,1693,2876,2530,3177,2706,1229
.word 208,2332,1061,3585,4145,4795,1071,890,504,4109
.word 2761,3877,2894,4591,807,3651,3186,3732,2085,557
.word 4173,145,4681,4060,599,3391,4820,166,2273,1003
.word 2230,1309,3008,850,4509,505,4826,4726,2898,1718
.word 3497,1358,2272,3752,2433,4577,3545,4147,3512
; s1:14
; s2:f
; s3:19
; s4:12
; s5:16
hits:    .word 0,0,0,0,0,0,0,0
CONTROL:    .word32 0x10000
DATA:    .word32 0x10008
DIVIDE:    .word 0x20C49C ; if I move this below the next line it won't load it (???)
mes:    .asciiz "Invalid number \:\n"    ; error message

.code
lwu \$t4,DIVIDE(\$zero)    ; t4 == 2^31 / 1000
daddi \$t1, \$zero, matrix    ; t1 == start of the array
daddi \$s7, \$t1, 800    ; s7 == end of the array
daddi \$t3, \$zero, hits    ; t3 == start of the hits array
first_iteration:
lwu \$t0,(\$t1)    ; load value to t0
dmulu \$t2, \$t0, \$t4 ; multiply by 2147484 - ! unsigned
dsra \$t8, \$t2, 31 ; right shift
; up till now we have divided by 1000 - valid for the range we care
loop:
lwu \$t0,(\$t1)    ; load value to t0
slti \$t7, \$t8, 5    ; meanwhile check for error
dmulu \$t2, \$t0, \$t4 ; multiply by 2147484 - ! unsigned
; meanwhile store the previous result !
dsll \$t6, \$t8, 3    ; multiply by 8 to get the offset
lwu \$t5, hits(\$t6) ; load the value from hits
beqz \$t7, error    ; if t2 was more than 4 error
daddi \$t5, \$t5, 1    ; increment it ! STRUCTURAL HAZARD
dsra \$t8, \$t2, 31 ; right shift
; up till now we have divided by 1000
sd \$t5, hits(\$t6)    ; store it back
; result stored !
condition:
bne \$t1,\$s7,loop
;store the last result and check for error !
slti \$t7, \$t8, 5    ; meanwhile check for error
dsll \$t6, \$t8, 3    ; multiply by 8 to get the offset
beqz \$t7, error    ; if t2 was more than 4 error !
lwu \$t5, hits(\$t6) ; load the value from hits
daddi \$t5, \$t5, 1    ; increment it
sd \$t5, hits(\$t6)    ; store it back
lwu \$s1, (\$t3)
lwu \$s2, 8(\$t3)
lwu \$s3, 16(\$t3)
lwu \$s4, 24(\$t3)
lwu \$s5, 32(\$t3)
halt
error:
; needed for printing
lwu \$s0,DATA(\$zero)    ; \$s0 = address of DATA register
lwu \$s6,CONTROL(\$zero)    ; \$s6 = address of CONTROL register
; end printing
daddi \$v0,\$zero,4       ; set for ascii output
sd \$t0,0(\$s0)           ; write address of message to DATA register
sd \$v0,0(\$s6)           ; make it happen
ld \$t0,(\$t1)    ; load value to t0
daddi \$v0,\$zero,2    ; set for signed integer output
sd \$t0,0(\$s0)           ; write integer to DATA register
sd \$v0,0(\$s6)           ; make it happen
halt```
Runs in 1223 cycles with branch buffer ON - but notice I abort on error, which adds error checking while in the loop ! The checks for < 0 are not needed as in the unsigned arithmetics negative numbers are much larger than 5000. The code looks a bit messed up cause I moved instructions around to avoid RAW, WAW etc hazards. Namely I store the results and error check _on the next iteration of the loop_ so I am kept busy during multiplication.

My question is - can I avoid the structural hazard in the loop body ?

Page 2 of 2 First 12