# Optimization issues - minimize branches

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 04-17-2013
std10093
Optimization issues - minimize branches
My goal is to minimize the number of branches (in assembly), which are very costly in our architecture. As a consequence, if I minimize the branches, then I would say that I can minimize the number of cycles, which is the real goal.

The problem:
An array of 100 elements, which are to be in range [0,4999]. I have to count how many numbers are in [0,999], [1000, 1999], [2000, 2999], [3000, 3999], [4000, 4999]. If the an element is out of bound, print error message.

My current idea is to loop over every element of the array and do this
• If A[i] < 1000 and A[i] >= 0, counter1++ and go to the next element
• If A[i] < 2000, counter2++ and go to the next
...
• If A[i] < 5000, counter5++ and go to the next

Then instead of checking if an element is >4999, which would be executed as many times as the size of the array, to add the counters and if the result is equal to the size of the array, then we are ok, if not print error message (at least one element was out of range).

What do you say? Any ideas?
• 04-17-2013
jimblumberg
Depending on how many numbers fall outside your desired range you may save time by first insuring the values are within your desired range. If they're outside the correct ranges you don need to do any of the other checks.

Jim
• 04-17-2013
Salem
How expensive is division on your machine?

For example, this should have no branches at all (save for the one in the for loop), but may be expensive on the /
counter[ A[i]/1000 ] ++;

Is the distribution of numbers random, or is something useful known about it?

Use a binary partition, although you don't get far with only 5 bins to choose from.
Code:

```if ( A[i] < 3000 ) {   if ( A[i] < 1000 ) {     counter1++;   } else if ( A[i] < 2000 ) {     counter2++;   } else {     counter3++;   } } else {   if ( A[i] < 4000 ) {     counter4++;   } else {     counter5++;   } }```
This has at most 3 comparisons (and branches), as opposed to 5 when done linearly, but it always costs at least 2 comparisons.
If your distribution is heavily biased to one particular range (which you check first), it may be worse than linear which is often out on the first comparison.
• 04-17-2013
Nominal Animal
I'm on roughly the same path with Salem's first suggestion, but I'd go a lot further.

Because the divisor is a constant, you don't need real division; you can multiply by the reciprocal instead.

If the architecture is 32-bit, I would use
Code:

```if (value >= 0 and value < 5000)   counter[ (4294968 * value) >> 32 ]++```
Note that 232 / 1000 = 4294967.296, so the reciprocal multiply gives a correct result only for 0 <= value < 6100999 without additional corrections. Since your interested range is just 0 to 4999, that'll do fine; just remember to add a comment in the source code about that!

Most architectures have a multiply instruction that returns only the high word -- "multiply high" --, so the full-word shift is implicit.

You can avoid branches completely, by reserving a slot with an index just under a power of two; i.e. having all bits set in the index. Because here the next power of two is 8, and index 7 is unused, we can use that (counter[7]) to collect all the unwanted entries -- we get the tally of "other" entries for free!

Here's my rough pseudo-assembly outline for the check:
Code:

```r1 = value r2 = 4999 r2 = r2 - r1      # will be negative (high bit set) if r1 > 4999 r2 = r2 or r1    # will become negative if r1 < 0 r2 = r2 sar 31    # shift arithmetic right, duplicates high bit to all bits # Divide by 1000 by multiplying by 4294968/232 = 0.00100000016391277313232421875. # Only yields valid result for 0 <= r1 <= 4999 < 6100999. r1 = (signed or unsigned) multiply high r1 and 4294968 r1 = r1 or r2    # if value was <0 or >4999, r1 ends up all ones here anyway. r1 = r1 and 7    # use only lowest 3 bits of r1 as index into count[] increment number at count + r1```
Note that most architectures have just one shift left, but two shift rights. Signed or arithmetic one copies the highest bit (sign bit) to new high bits, and unsigned or normal shift sets new high bits zero. Use the signed or arithmetic version, above.

Because negative r1 and r1>=5000 end up not affecting the result at all, it does not matter whether you use signed or unsigned multiply high.

Finally, if you're very tight on memory, too, the above only accesses counter[0], counter[1], counter[2], counter[3], counter[4], and counter[7]; you can use counter[5] and counter[6] for something else.

I hope you get new workable ideas from this.
• 04-17-2013
std10093
Jim I totally agree, but the prof said that he is going to test if, when inputting out of range number, he gets an "error" message on the screen, but he is going to evaluate our work, with normal input. So, let's focus on normal input :)

The input is going to be balanced, that is 20 numbers per range (I do not know if they are going to come consecutive or not).
Division takes 24 cycles to be executed in our simulator (winmips64). That, I think, is not ok, so we forget about it.

Are you sure that counter1 should be incremented in the line you placed it? Will the balanced input make a difference? Should I go for it, or the linear one? What do you suggest? :)
• 04-17-2013
std10093
@nomimal, long time to see you posting (well I have a long time to post too I have to admit). Nice to see you again :)

winmips64 has this command: lui reg,imm - load upper half of register immediate (which can be somehow useful, right?)

We have these commands for shifting operations,
dsra reg,reg,imm - shift right arithmetic
I think dsar is ok for what you say.

Well actually, what you say seems to be clever˛, but I am trying to get it :) Samaras does not feel comfortable enough with assembly (yet) :)
Also, notice that the registers are of 64 bits in our simulator and using .word32 for the array inputs (for example), would store two values in a register, so the hint was to use .word and not .word32.
Does your idea stands true for 64bit?
• 04-17-2013
Nominal Animal
Oh! It's the Windows 64-bit MIPS simulator! I kind-of assumed you used a real-world 32-bit microcontroller. First, let me explain what my example code does, and why:

When a value is negative, the highest bit will be set. When it is not, it will be zero. This is true on all architectures that use two's complement integers (and all the normal ones do), no matter what the word size is. In particular, if MAXIMUM <= MINIMUM and (MAXIMUM - MINIMUM) fits in a signed integer register, the expression
Code:

`(value - MINIMUM) or (MAXIMUM - value)`
will have a zero sign bit when MINIMUM<=value<=MAXIMUM, and one otherwise.

The 'shift arithmetic right by (word size - 1)' turns that into a mask, where all bits are zero if MINIMUM<=value<=MAXIMUM, and one otherwise. (Note that for WinMIPS the maximum allowed shift using a constant is 31, so you'd need to do three shifts to shift by 63 bits, or use the dsrav instruction with 63 stored in a register, for a variable shift by 63 bits.)

Next, you calculate the index. You only need to care that the index is valid if MINIMUM<=value<=MAXIMUM. In the other cases the next step will override the value you calculated anyway.

Finally, you OR the mask and the index together, and AND with a suitable mask having low bits set to ones. (The mask defines the size of the array, and is a power of two, minus one. Here, it is 7, since 8 is the next power of two large enough to contain the interesting slots, and the high "garbage" or "other" slot.) If the value was within MINIMUM and MAXIMUM, inclusive, the proper slot is used, because the OR mask was zero, and the result of the index calculation is used as-is. Otherwise, all the bits in the OR mask are ones, and the result of the index calculation does not matter. If all the bits were ones, the AND mask turns that into the last index, the same as the AND mask. (But remember, the AND mask must be a power of two, less one, for a power of two count array size.)

It's a very useful trick. Only use it if you understand how it works. However, WinMIPS64 has instructions that make it even easier. On to simpler solutions!

The simulator supports some very useful instructions. Here are the interesting ones (I wouldn't use ALL of them, though -- just making you think more.. :))
Code:

```movz RTARGET, RSOURCE, RTEST     If the test register is zero, source is copied to target.     If the test register is nonzero, nothing happens. movnz RTARGET, RSOURCE, RTEST     If the test register is zero, nothing happens.     If the test register is nonzero, source is copied to target. dsra RTARGET, RSOURCE, BITS     As you wrote, this shifts source BITS bits right,     copying the sign bit to the new high bits.     WinMIPS only allows 0 <= BITS <= 31. dsra RTARGET, RSOURCE, RN     This shifts source down by the number of bits stored in the N register,     and stores the result in the target register.     I couldn't find the limitations, but I suspect the value in N register     must be 0..63, inclusive. slt RTARGET, RA, RB     If register A is less than register B using signed 64-bit integers,     then target register is set to 1.     If register A is equal or greater to register B using 64-bit signed integers,     then target register is set to 0. dsub RTARGET, RSOURCE, RVALUE     target = source - value```
To avoid all branching, your basic approach should have a count array with a power of two entries (8, in this case), and use the masking logic, so that the last slot in the array is used if the value is outside the acceptable limits.

The above instructions allow two different ways to compute the mask very efficiently.

First one is to use the slt instruction to compare the value to the limits, and then the conditional moves to copy either a zero or -1 (which is all bits 1 in binary) to a mask register, then OR the masks together. In the inner loop, this should be doable in five instructions (assuming you have loaded the constants from data before the start of the loop, and don't modify them). Remember, WinMIPS64 has lots of registers you can use; keep the most needed constants in registers.

The second one is closer to my previous example. The basic idea is based on the fact that for two's complement numbers,
Code:

```Decimal  Binary  0 = 00000000...00000000 -1 = 11111111...11111111 -2 = 11111111...11111110```
Here, you again use the slt instruction to compare the data value to the upper and lower boundaries. Instead of using the conditional moves, you subtract the 0 or 1 the slt yielded from a register initially set to zero. If the value is within the bounds, the initially zero register will have all bits zero if the value was within the bounds, and all bits except the least significant bit ones otherwise. The lowest bit may be zero or one if the value was outside the bounds, but an arithmetic shift right by one bit will take care of that. This approach should also be doable in five instructions, except you probably need a sixth one every iteration at the start, just to clear out the mask register.

I didn't find any mention of WinMIPS64 instruction timings, so I have no idea which approach (or perhaps even a hybrid one?) would be the fastest.

As to the division, the multiplication by a reciprocal always works (within the rounding errors, which you have to check -- I already checked those I've shown myself), if you choose a suitable power of two divisor. Since WinMIPS64 does not have a multiply-high instruction, you need to explicitly code the shift, too. Since the shift is limited to 31 on WinMIPS64, it makes sense to use a reciprocal to 231 instead; i.e. 2147484/231 = 0.00100000016391277313232421875. (On many architectures the variable-length shifts tend to be slower than fixed-size shifts; I suspect the same of WinMIPS64.)
So, if you use
Code:

```rA = data rB = 2147484 rC = rA * rB rA = rC shifted right 31```
you'll get rA=data/1000 for 0<=data<=6100998.

If you first read the data word, then do the limit part, then the index calculation part, and finally the mask OR'ing and AND'ing, you only need to read the data word once, and you should have pretty tight code.

Note that on WinMIPS64, immediate constants are limited to 16 bits. You should set your limits (4999, 2147484, and so on) as words in the data segment, then load them into registers at the start of your loop, and not modify them within the loop. Remember, the WinMIPS64 has plenty of registers, you can put some of them to good use even if they hold just constants!

If there's something I should clarify or forgot to explain, just let me know.
• 04-18-2013
std10093
Firstly, I would like to thank all of you for answering. All of you are respectful members of the forum to me :)

You explained pretty good. I went for Salem's way (which used division).

Notice, that in winMips64, we have one ALU, which needs 7 cycles for a mul, 4 for a FP adder, 24 for a division and then, for all the others, only one unit (EX), to execute the operation in need.

I got these results(For an array of 9 elements): (remember that cycles is the goal!)
Division
Cycles = 381, I = 108, CPI = 3.528, 468 Raw stalls, 9 structural stalls, 8 branch taken stalls.
Raw stalls are what makes this solution to loose, because, division takes more than 20 cycles and the next commands need its result, thus they all wait for the division to be completed, in every loop!
The structural hazard, comes from the fact, that almost all arithmetic operations have to be executed by one ALU, thus conflicts occur.
I used the branch target buffer(BTF) and cycles reduced to 377. (some stalls modified)

Now, the way Salem said, but by using what nomimal said, mul and shift. (again with the BTB active).
Reciprocal
Cycles = 233, I = 117, CPI = 1.1991, 144 Raw stalls, 18 structural stalls, 2 branch taken stalls, 2 branch misprediction. (the branch stalls are the first time and the last time of the branch for the loop, thus they will remain constant, even when the elements of the array are 100, so let's not worry about them).
Ok, now, I present you the code and then I am going to say why we have stalls and then eagerly wait for your answers and ideas :D
Well, now I see that the measurements where taken with printing the counter array, which is not a requirement (the requirement is to store them in specific registers).
Code:

```.text ld r4, sizeArray(r12) ld r7, oneThousand(r12) ld r11, arch(r12) ld r13, operand(r12) loop: ld        r3, numInput(r8) #ddiv r3, r3,r7 slt r5, r8, r4 dmul r14, r3, r13 dsra r3, r14, 31 dmul r9, r3, r11 daddi r8, r8, 8 ld r12, counter(r9) daddi r12, r12, 1 sd r12, counter(r9) bnez r5, loop #out of loop, r5 = 0 ld r17, counter(r5) daddi r5, r5, 8 ld r18, counter(r5) daddi r5, r5, 8 ld r19, counter(r5) daddi r5, r5, 8 ld r20, counter(r5) daddi r5, r5, 8 ld r21, counter(r5) halt        #ciao... .data numInput: .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999 sizeArray: .word 64 # 8*(N - 1) oneThousand: .word 1000 arch: .word 8 operand: .word 2147484 counter: .word 0, 0, 0, 0, 0```
The RAW stalls happen because of the others have to wait for mul's result (dsra has to wait for it, but then the next dmul, waits for the result of dsra!).
Then the first dmul, keep the ID stage busy, which makes daddi r8, to wait until the dmul finishes, plus a cycle, in order the result to traverse the other commands among them.
Then, while dmul r14 is on MEM, dsra on EX and dmul r9 on ID (waiting for the result of dsra), a structural stall occurs in ALU (two commands are trying to access EX simultaneously(not sure which two-or more).
Of course, daddi r8 also waits, because ID is busy by dmul r9.

Then pretty much, the same things happen. I am trying to reorder the code, but did not have any ideas that reduced the number of cycles.

By the way, winmips64 saw all these, in a nice simulation. Download from here if you wish.

I am waiting for your comments on this code and tomorrow I am planning to write the code with branches, in order to compare results.

EDIT: If we could replace array[i]*2147484 (which is executed in every loop), with something that does not use mul, then this would help!!)

IDEA: loop unrolling :)
• 04-19-2013
std10093
For N=21, I got 497 cycles with the code above.
I replaced line 14, with this
Code:

`dsll r9, r3, 3`
which resulted in 391 cycles!!! The case is that shifts take only one stage in ALU to be executed, while mul takes 7!!!

Now, if only I could replace the array[i]*2147484 with some other command(s), except division then, this would benefit me a lot.

EDIT: 2147484=2^21+2^15+2^14+2^10+2^7+2^4+2^3+2^2, but I do not know if this can help.. It has 8 terms, thus 8 shifts...
• 04-19-2013
Nominal Animal
Well, instead of multiplying by the reciprocal, you can use slt or slti to evaluate
Code:

`counter_offset = ( 5 - (value < 5000) ? 1 : 0) - (value < 4000) ? 1 : 0) - (value < 3000) ? 1 : 0) - (value < 2000) ? 1 : 0) - (value < 1000) ? 1 : 0) - (value < 0) ? 1 : 0) ) << 3`
which yields -1 if value < 0, 0 if 0<=value<1000, 1 if 1000<=value<2000, ..., 5 if 5000<=value. In other words, you reserve one word before the five counter elements for underflow, and one word to catch the overflow.

This runs in 206 cycles for nine elements, with 18 RAW stalls, at 1.17 cycles per instruction:
Code:

```            .text             andi    r0, r4, 0      ; r0=0, constant             andi    r1, r4, 0      ; r1=0, offset to Input             ld      r2, Length(r0)  ; r2=Length, limit to offset             ori    r3, r0, 5      ; r3=5, number of Counter slots loop:             ld      r4, Input(r1)  ; Load next value from Input             slti    r15, r4, 5000  ; r15 = (r4 < 5000) ? 1 : 0             slti    r14, r4, 4000  ; r14 = (r4 < 4000) ? 1 : 0             dsub    r5, r3, r15    ; r5 = r3 - r15             slti    r13, r4, 3000  ; r13 = (r4 < 3000) ? 1 : 0             dsub    r5, r5, r14    ; r5 = r5 - r14             slti    r12, r4, 2000  ; r12 = (r4 < 2000) ? 1 : 0             dsub    r5, r5, r13    ; r5 = r5 - r13             slti    r11, r4, 1000  ; r11 = (r4 < 1000) ? 1 : 0             dsub    r5, r5, r12    ; r5 = r5 - r12             slt    r10, r4, r0    ; r10 = (r4 <    0) ? 1 : 0             dsub    r5, r5, r11    ; r5 = r5 - r11             daddi  r1, r1, 8      ; Increase offset to next value             dsub    r5, r5, r10    ; r5 = r5 - r10             dsll    r5, r5, 3      ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.             ld      r6, Counter(r5)             daddi  r6, r6, 1             sd      r6, Counter(r5)                      bne    r1, r2, loop    ; Loop until all done             halt             .data             ; Number of values * 8 Length:    .word 72             ; Values to classify Input:      .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999             ; Counts for [0,999], ... [4000,4999].             .word 0                ; Underflow slot Counter:    .word 0, 0, 0, 0, 0    ; Data slots             .word 0                ; Overflow slot```
For 21 input values, it takes 470 cycles. Your code, with the modification, takes 395 cycles, using the same inputs.

However, mine handles inputs outside the expected [0, 4999] range too. If I remove that, it runs in 386 cycles. (There are probably a few cycles to be saved by rearranging the instructions, or using more registers to avoid the 42 raw stalls I get with 21 input values -- two raw stalls per loop.)

As it happens, using several slti + dsub pairs to calculate the index directly seems slightly faster -- for five target slots -- than the multiplication by reciprocal. Obviously, if you had fewer slots, it'd be even faster. But, if you had more slots, even just 6, it'd be slower than the multiplication method! (It's pretty interesting the exercise has hit the exact spot for this.)

Because doing a histogram -- and that's what this is -- usually involves more than five slots, I'd say your code, the one that uses multiplications, is the better one, even if the one shown here is a tiny bit faster. The reason is that maintaining and adapting that code to changing needs is much easier, and changing the number of slots does not really affect the running time. In this code, changing the number of slots is pretty hairy, and will directly affect the running time.

I wanted to show this full code because it is yet another approach to the same problem, and while it may be a tiny bit faster, I would not really recommend using it. (Thus, just telling you and others about it might give the wrong impression, and waste time in developing a solution that works, but is extremely fragile wrt. changing requirements.) For an exercise it might be okay as an answer -- I'd personally submit both, in case the instructor only looks at the timings, but so that you could pipe up at the exercise session and explain why the approach is not a good idea in the general case. (As an instructor, I'd personally give an extra point for that sort of important observation. I don't know if anybody else would, though.)

I personally would not use this approach in real-world code, unless I was absolutely sure the number of slots would never change, and it was timing-critical code on a MIPS-based microcontroller. While the multiplication-via-reciprocal may be slower for two-to-five slots, it's much more robust, and in my opinion, the better solution.

I hope you found this interesting!
• 04-19-2013
std10093
As always, the case is full of trade offs! I found them interesting˛.. I am working on it.
For N = 100, the code that does one mul goes better.
Question:2147484 <- How did this come to play? ;p

What also bothers me a bit, is that I have to load the counters in specific registers (r17-r21), but we have it -at both codes- in an array, thus, right before halting the program, I have to use 5 loads and 4 additions.. I can't figure out how I can run away of them!
• 04-19-2013
Nominal Animal
Quote:

Originally Posted by std10093
Question:2147484 <- How did this come to play?

It's the reciprocal of 1000 using 31-bit integers:

231 / 1000 = 2147483648 / 1000 = 2147483.648 ≃ 2147484

The value is rounded up, because that gives the correct integer result in a larger range of values.

When you multiply by that value, then shift right 31 bits, you really multiply by
2147484 / 231 = 0.00100000016391277313232421875, which is just over 1/1000 = 0.001.

(If you rounded down, that factor would be less than 0.001, and you'd get incorrect integer result at 1000. That's why you round up, earlier, to get the factor at or above what you need, so that the result is correct in a wider range of inputs.)

Quote:

Originally Posted by std10093
What also bothers me a bit, is that I have to load the counters in specific registers (r17-r21), but we have it -at both codes- in an array, thus, right before halting the program, I have to use 5 loads and 4 additions.. I can't figure out how I can run away of them!

Hm. You could always use the slti method to count
r17 = number of occurrences < 1000
r18 = number of occurrences < 2000
r19 = number of occurrences < 3000
r20 = number of occurrences < 4000
r21 = number of occurrences < 5000
directly, without storing in any kind of an array, then just before halt, subtract the sum of the lower counts from the upper counts, i.e.
Code:

```r21 = r21 - r20 r20 = r20 - r19 r19 = r19 - r18 r18 = r18 - r17```
which will yield the correct counters in the correct registers. Note that the order is important, because
Code:

```count of 4000 .. 4999 = (count of less than 5000) - (count of less than 4000) count of 3000 .. 3999 = (count of less than 4000) - (count of less than 3000) count of 2000 .. 2999 = (count of less than 3000) - (count of less than 2000) and so on```
For the life of me, I can't think of why I didn't do that in my previous snippet. ;)
• 04-20-2013
Nominal Animal
I just wrote and checked the code I described above in #12. It runs with just one "branch taken stall" per iteration, no other stalls (in particular, zero RAW stalls), and runs in about 17 × values + 20 clock cycles.

Running some test runs, the histogram calculations take
9 values: 173 cycles
21 values: 377 cycles
100 values: 1720 cycles

That code is not optimal, although I did manage to interleave all the slti and dadd instructions to avoid all stalls. Since the code handles both underflows (values < 0) and overflows (values >= 5000), accumulating them in registers r16 and r22, respectively, it could be a lot faster if values are known to always be within 0..4999, inclusive. The code size is 132 bytes, and the code runs in 1.064 cycles per instruction on average.

But, remember, there is room for improvement yet..
• 04-20-2013
std10093
I unrolled the loop(5 at a time, despite the fact that mul takes 7 stages) and made input check too.
For N = 100
#Stats: C = 933, I = 862, CPI = 1.082
#Stalls: 1 RAW, 60 structural, 4 branch taken, 2 branch misprediction

For the 2147484:
>you'll get rA=data/1000 for 0<=data<=6100998 (post #7)
How did you found the upper bound?

As for the slti you said, in order to get rid of the array of counters and use the required registers, I did not understand you. Maybe the fact that it is 5:50 at the morning has some impact.

//The fact that we have a number of cycles < 1000, should make us all feel great :D
• 04-21-2013
Salem
Re Nominal Animal's post #10
Code:

```            dsub    r5, r5, r10    ; r5 = r5 - r10             dsll    r5, r5, 3      ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.             ld      r6, Counter(r5)             daddi  r6, r6, 1             sd      r6, Counter(r5)                      bne    r1, r2, loop    ; Loop until all done               halt```
Search "delay slot" in http://www.weblearn.hs-bremen.de/ris...S/mips-isa.pdf
On first reading, I thought "how is the halt not executed in the delay slot?" On RTFM, it seems there are two kinds of branch instructions, one of which does not make use of the delay slot.

So the first tweak might be to use the delay slot executing branch by writing
bne r1, r2, loop ; Loop until all done (use the other kind of bne!!)
sd r6, Counter(r5)

However, this might not be such a good idea for a couple of reasons.
1. Apparently, the delayed instruction isn't used(?) if the branch is not taken.
2. Some descriptions indicate that branch delay slot instructions are deprecated, so it might not even be available.

The other issue is load delay slots, where load followed by use causes a stall. An example of this is the ld followed by addi
With a little reordering, perhaps this.
Code:

```            .text             andi    r0, r4, 0      ; r0=0, constant             andi    r1, r4, 0      ; r1=0, offset to Input             ld      r2, Length(r0)  ; r2=Length, limit to offset             ori    r3, r0, 5      ; r3=5, number of Counter slots             ld      r4, Input(r1)  ; Load FIRST value from Input loop:             slti    r15, r4, 5000  ; r15 = (r4 < 5000) ? 1 : 0             slti    r14, r4, 4000  ; r14 = (r4 < 4000) ? 1 : 0             dsub    r5, r3, r15    ; r5 = r3 - r15             slti    r13, r4, 3000  ; r13 = (r4 < 3000) ? 1 : 0             dsub    r5, r5, r14    ; r5 = r5 - r14             slti    r12, r4, 2000  ; r12 = (r4 < 2000) ? 1 : 0             dsub    r5, r5, r13    ; r5 = r5 - r13             slti    r11, r4, 1000  ; r11 = (r4 < 1000) ? 1 : 0             dsub    r5, r5, r12    ; r5 = r5 - r12             slt    r10, r4, r0    ; r10 = (r4 <    0) ? 1 : 0             dsub    r5, r5, r11    ; r5 = r5 - r11             daddi  r1, r1, 8      ; Increase offset to next value             dsub    r5, r5, r10    ; r5 = r5 - r10             dsll    r5, r5, 3      ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.             ld      r6, Counter(r5)             ld      r4, Input(r1)  ; Load NEXT value from Input             daddi  r6, r6, 1             sd      r6, Counter(r5)                      bne    r1, r2, loop    ; Loop until all done             halt             .data             ; Number of values * 8 Length:    .word 72             ; Values to classify Input:      .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999             ; Counts for [0,999], ... [4000,4999].             .word 0                ; Underflow slot Counter:    .word 0, 0, 0, 0, 0    ; Data slots             .word 0                ; Overflow slot```
Putting the load of r4 just after the load of r6 serves two purposes
- it kills off the load delay of r6 when it comes to the daddi
- it also kills off the load delay of r4 when it comes to the slti (for all loop iterations)
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last