Like Tree2Likes

Optimization issues - minimize branches

This is a discussion on Optimization issues - minimize branches within the Tech Board forums, part of the Community Boards category; My goal is to minimize the number of branches (in assembly), which are very costly in our architecture. As a ...

  1. #1
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681

    Optimization issues - minimize branches

    My goal is to minimize the number of branches (in assembly), which are very costly in our architecture. As a consequence, if I minimize the branches, then I would say that I can minimize the number of cycles, which is the real goal.

    The problem:
    An array of 100 elements, which are to be in range [0,4999]. I have to count how many numbers are in [0,999], [1000, 1999], [2000, 2999], [3000, 3999], [4000, 4999]. If the an element is out of bound, print error message.

    My current idea is to loop over every element of the array and do this
    • If A[i] < 1000 and A[i] >= 0, counter1++ and go to the next element
    • If A[i] < 2000, counter2++ and go to the next
      ...
    • If A[i] < 5000, counter5++ and go to the next

    Then instead of checking if an element is >4999, which would be executed as many times as the size of the array, to add the counters and if the result is equal to the size of the array, then we are ok, if not print error message (at least one element was out of range).

    What do you say? Any ideas?
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  2. #2
    Registered User
    Join Date
    May 2010
    Posts
    2,760
    Depending on how many numbers fall outside your desired range you may save time by first insuring the values are within your desired range. If they're outside the correct ranges you don need to do any of the other checks.

    Jim

  3. #3
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    How expensive is division on your machine?

    For example, this should have no branches at all (save for the one in the for loop), but may be expensive on the /
    counter[ A[i]/1000 ] ++;

    Is the distribution of numbers random, or is something useful known about it?

    Use a binary partition, although you don't get far with only 5 bins to choose from.
    Code:
    if ( A[i] < 3000 ) {
      if ( A[i] < 1000 ) {
        counter1++;
      } else if ( A[i] < 2000 ) {
        counter2++;
      } else {
        counter3++;
      }
    } else {
      if ( A[i] < 4000 ) {
        counter4++;
      } else {
        counter5++;
      }
    }
    This has at most 3 comparisons (and branches), as opposed to 5 when done linearly, but it always costs at least 2 comparisons.
    If your distribution is heavily biased to one particular range (which you check first), it may be worse than linear which is often out on the first comparison.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  4. #4
    Registered User
    Join Date
    Oct 2011
    Posts
    847
    I'm on roughly the same path with Salem's first suggestion, but I'd go a lot further.

    Because the divisor is a constant, you don't need real division; you can multiply by the reciprocal instead.

    If the architecture is 32-bit, I would use
    Code:
    if (value >= 0 and value < 5000)
       counter[ (4294968 * value) >> 32 ]++
    Note that 232 / 1000 = 4294967.296, so the reciprocal multiply gives a correct result only for 0 <= value < 6100999 without additional corrections. Since your interested range is just 0 to 4999, that'll do fine; just remember to add a comment in the source code about that!

    Most architectures have a multiply instruction that returns only the high word -- "multiply high" --, so the full-word shift is implicit.

    You can avoid branches completely, by reserving a slot with an index just under a power of two; i.e. having all bits set in the index. Because here the next power of two is 8, and index 7 is unused, we can use that (counter[7]) to collect all the unwanted entries -- we get the tally of "other" entries for free!

    Here's my rough pseudo-assembly outline for the check:
    Code:
    r1 = value
    
    r2 = 4999
    r2 = r2 - r1      # will be negative (high bit set) if r1 > 4999
    r2 = r2 or r1     # will become negative if r1 < 0
    r2 = r2 sar 31    # shift arithmetic right, duplicates high bit to all bits
    
    # Divide by 1000 by multiplying by 4294968/232 = 0.00100000016391277313232421875.
    # Only yields valid result for 0 <= r1 <= 4999 < 6100999.
    r1 = (signed or unsigned) multiply high r1 and 4294968
    r1 = r1 or r2     # if value was <0 or >4999, r1 ends up all ones here anyway.
    r1 = r1 and 7     # use only lowest 3 bits of r1 as index into count[]
    
    increment number at count + r1
    Note that most architectures have just one shift left, but two shift rights. Signed or arithmetic one copies the highest bit (sign bit) to new high bits, and unsigned or normal shift sets new high bits zero. Use the signed or arithmetic version, above.

    Because negative r1 and r1>=5000 end up not affecting the result at all, it does not matter whether you use signed or unsigned multiply high.

    Finally, if you're very tight on memory, too, the above only accesses counter[0], counter[1], counter[2], counter[3], counter[4], and counter[7]; you can use counter[5] and counter[6] for something else.

    I hope you get new workable ideas from this.
    Last edited by Nominal Animal; 04-17-2013 at 04:35 PM.

  5. #5
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    Jim I totally agree, but the prof said that he is going to test if, when inputting out of range number, he gets an "error" message on the screen, but he is going to evaluate our work, with normal input. So, let's focus on normal input

    The input is going to be balanced, that is 20 numbers per range (I do not know if they are going to come consecutive or not).
    Division takes 24 cycles to be executed in our simulator (winmips64). That, I think, is not ok, so we forget about it.

    Are you sure that counter1 should be incremented in the line you placed it? Will the balanced input make a difference? Should I go for it, or the linear one? What do you suggest?
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  6. #6
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    @nomimal, long time to see you posting (well I have a long time to post too I have to admit). Nice to see you again

    winmips64 has this command: lui reg,imm - load upper half of register immediate (which can be somehow useful, right?)

    We have these commands for shifting operations,
    dsra reg,reg,imm - shift right arithmetic
    I think dsar is ok for what you say.

    Well actually, what you say seems to be clever˛, but I am trying to get it Samaras does not feel comfortable enough with assembly (yet)
    Also, notice that the registers are of 64 bits in our simulator and using .word32 for the array inputs (for example), would store two values in a register, so the hint was to use .word and not .word32.
    Does your idea stands true for 64bit?
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  7. #7
    Registered User
    Join Date
    Oct 2011
    Posts
    847
    Oh! It's the Windows 64-bit MIPS simulator! I kind-of assumed you used a real-world 32-bit microcontroller. First, let me explain what my example code does, and why:

    When a value is negative, the highest bit will be set. When it is not, it will be zero. This is true on all architectures that use two's complement integers (and all the normal ones do), no matter what the word size is. In particular, if MAXIMUM <= MINIMUM and (MAXIMUM - MINIMUM) fits in a signed integer register, the expression
    Code:
    (value - MINIMUM) or (MAXIMUM - value)
    will have a zero sign bit when MINIMUM<=value<=MAXIMUM, and one otherwise.

    The 'shift arithmetic right by (word size - 1)' turns that into a mask, where all bits are zero if MINIMUM<=value<=MAXIMUM, and one otherwise. (Note that for WinMIPS the maximum allowed shift using a constant is 31, so you'd need to do three shifts to shift by 63 bits, or use the dsrav instruction with 63 stored in a register, for a variable shift by 63 bits.)

    Next, you calculate the index. You only need to care that the index is valid if MINIMUM<=value<=MAXIMUM. In the other cases the next step will override the value you calculated anyway.

    Finally, you OR the mask and the index together, and AND with a suitable mask having low bits set to ones. (The mask defines the size of the array, and is a power of two, minus one. Here, it is 7, since 8 is the next power of two large enough to contain the interesting slots, and the high "garbage" or "other" slot.) If the value was within MINIMUM and MAXIMUM, inclusive, the proper slot is used, because the OR mask was zero, and the result of the index calculation is used as-is. Otherwise, all the bits in the OR mask are ones, and the result of the index calculation does not matter. If all the bits were ones, the AND mask turns that into the last index, the same as the AND mask. (But remember, the AND mask must be a power of two, less one, for a power of two count array size.)

    It's a very useful trick. Only use it if you understand how it works. However, WinMIPS64 has instructions that make it even easier. On to simpler solutions!


    The simulator supports some very useful instructions. Here are the interesting ones (I wouldn't use ALL of them, though -- just making you think more.. )
    Code:
    movz RTARGET, RSOURCE, RTEST
        If the test register is zero, source is copied to target.
        If the test register is nonzero, nothing happens.
    
    movnz RTARGET, RSOURCE, RTEST
        If the test register is zero, nothing happens.
        If the test register is nonzero, source is copied to target.
    
    dsra RTARGET, RSOURCE, BITS
        As you wrote, this shifts source BITS bits right,
        copying the sign bit to the new high bits.
        WinMIPS only allows 0 <= BITS <= 31.
    
    dsra RTARGET, RSOURCE, RN
        This shifts source down by the number of bits stored in the N register,
        and stores the result in the target register.
        I couldn't find the limitations, but I suspect the value in N register
        must be 0..63, inclusive.
    
    slt RTARGET, RA, RB
        If register A is less than register B using signed 64-bit integers,
        then target register is set to 1.
        If register A is equal or greater to register B using 64-bit signed integers,
        then target register is set to 0.
    
    dsub RTARGET, RSOURCE, RVALUE
        target = source - value
    To avoid all branching, your basic approach should have a count array with a power of two entries (8, in this case), and use the masking logic, so that the last slot in the array is used if the value is outside the acceptable limits.

    The above instructions allow two different ways to compute the mask very efficiently.

    First one is to use the slt instruction to compare the value to the limits, and then the conditional moves to copy either a zero or -1 (which is all bits 1 in binary) to a mask register, then OR the masks together. In the inner loop, this should be doable in five instructions (assuming you have loaded the constants from data before the start of the loop, and don't modify them). Remember, WinMIPS64 has lots of registers you can use; keep the most needed constants in registers.

    The second one is closer to my previous example. The basic idea is based on the fact that for two's complement numbers,
    Code:
    Decimal   Binary
     0 = 00000000...00000000
    -1 = 11111111...11111111
    -2 = 11111111...11111110
    Here, you again use the slt instruction to compare the data value to the upper and lower boundaries. Instead of using the conditional moves, you subtract the 0 or 1 the slt yielded from a register initially set to zero. If the value is within the bounds, the initially zero register will have all bits zero if the value was within the bounds, and all bits except the least significant bit ones otherwise. The lowest bit may be zero or one if the value was outside the bounds, but an arithmetic shift right by one bit will take care of that. This approach should also be doable in five instructions, except you probably need a sixth one every iteration at the start, just to clear out the mask register.

    I didn't find any mention of WinMIPS64 instruction timings, so I have no idea which approach (or perhaps even a hybrid one?) would be the fastest.

    As to the division, the multiplication by a reciprocal always works (within the rounding errors, which you have to check -- I already checked those I've shown myself), if you choose a suitable power of two divisor. Since WinMIPS64 does not have a multiply-high instruction, you need to explicitly code the shift, too. Since the shift is limited to 31 on WinMIPS64, it makes sense to use a reciprocal to 231 instead; i.e. 2147484/231 = 0.00100000016391277313232421875. (On many architectures the variable-length shifts tend to be slower than fixed-size shifts; I suspect the same of WinMIPS64.)
    So, if you use
    Code:
    rA = data
    rB = 2147484
    rC = rA * rB
    rA = rC shifted right 31
    you'll get rA=data/1000 for 0<=data<=6100998.

    If you first read the data word, then do the limit part, then the index calculation part, and finally the mask OR'ing and AND'ing, you only need to read the data word once, and you should have pretty tight code.

    Note that on WinMIPS64, immediate constants are limited to 16 bits. You should set your limits (4999, 2147484, and so on) as words in the data segment, then load them into registers at the start of your loop, and not modify them within the loop. Remember, the WinMIPS64 has plenty of registers, you can put some of them to good use even if they hold just constants!

    If there's something I should clarify or forgot to explain, just let me know.

  8. #8
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    Firstly, I would like to thank all of you for answering. All of you are respectful members of the forum to me

    You explained pretty good. I went for Salem's way (which used division).

    Notice, that in winMips64, we have one ALU, which needs 7 cycles for a mul, 4 for a FP adder, 24 for a division and then, for all the others, only one unit (EX), to execute the operation in need.

    I got these results(For an array of 9 elements): (remember that cycles is the goal!)
    Division
    Cycles = 381, I = 108, CPI = 3.528, 468 Raw stalls, 9 structural stalls, 8 branch taken stalls.
    Raw stalls are what makes this solution to loose, because, division takes more than 20 cycles and the next commands need its result, thus they all wait for the division to be completed, in every loop!
    The structural hazard, comes from the fact, that almost all arithmetic operations have to be executed by one ALU, thus conflicts occur.
    I used the branch target buffer(BTF) and cycles reduced to 377. (some stalls modified)

    Now, the way Salem said, but by using what nomimal said, mul and shift. (again with the BTB active).
    Reciprocal
    Cycles = 233, I = 117, CPI = 1.1991, 144 Raw stalls, 18 structural stalls, 2 branch taken stalls, 2 branch misprediction. (the branch stalls are the first time and the last time of the branch for the loop, thus they will remain constant, even when the elements of the array are 100, so let's not worry about them).
    Ok, now, I present you the code and then I am going to say why we have stalls and then eagerly wait for your answers and ideas
    Well, now I see that the measurements where taken with printing the counter array, which is not a requirement (the requirement is to store them in specific registers).
    Code:
    .text
    
    ld r4, sizeArray(r12)
    ld r7, oneThousand(r12)
    ld r11, arch(r12)
    ld r13, operand(r12)
    
    loop: ld	r3, numInput(r8)
    
    #ddiv r3, r3,r7
    slt r5, r8, r4
    dmul r14, r3, r13
    dsra r3, r14, 31
    dmul r9, r3, r11
    daddi r8, r8, 8
    ld r12, counter(r9)
    daddi r12, r12, 1
    sd r12, counter(r9)
    bnez r5, loop
    
    #out of loop, r5 = 0
    ld r17, counter(r5)
    daddi r5, r5, 8
    ld r18, counter(r5)
    daddi r5, r5, 8
    ld r19, counter(r5)
    daddi r5, r5, 8
    ld r20, counter(r5)
    daddi r5, r5, 8
    ld r21, counter(r5)
    halt 	#ciao...
    
    .data
    numInput: .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999
    sizeArray: .word 64 # 8*(N - 1)
    oneThousand: .word 1000
    arch: .word 8
    operand: .word 2147484
    counter: .word 0, 0, 0, 0, 0
    The RAW stalls happen because of the others have to wait for mul's result (dsra has to wait for it, but then the next dmul, waits for the result of dsra!).
    Then the first dmul, keep the ID stage busy, which makes daddi r8, to wait until the dmul finishes, plus a cycle, in order the result to traverse the other commands among them.
    Then, while dmul r14 is on MEM, dsra on EX and dmul r9 on ID (waiting for the result of dsra), a structural stall occurs in ALU (two commands are trying to access EX simultaneously(not sure which two-or more).
    Of course, daddi r8 also waits, because ID is busy by dmul r9.

    Then pretty much, the same things happen. I am trying to reorder the code, but did not have any ideas that reduced the number of cycles.

    By the way, winmips64 saw all these, in a nice simulation. Download from here if you wish.

    I am waiting for your comments on this code and tomorrow I am planning to write the code with branches, in order to compare results.

    EDIT: If we could replace array[i]*2147484 (which is executed in every loop), with something that does not use mul, then this would help!!)

    IDEA: loop unrolling
    Last edited by std10093; 04-18-2013 at 07:34 PM.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  9. #9
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    For N=21, I got 497 cycles with the code above.
    I replaced line 14, with this
    Code:
    dsll r9, r3, 3
    which resulted in 391 cycles!!! The case is that shifts take only one stage in ALU to be executed, while mul takes 7!!!

    Now, if only I could replace the array[i]*2147484 with some other command(s), except division then, this would benefit me a lot.

    EDIT: 2147484=2^21+2^15+2^14+2^10+2^7+2^4+2^3+2^2, but I do not know if this can help.. It has 8 terms, thus 8 shifts...
    Last edited by std10093; 04-19-2013 at 10:44 AM.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  10. #10
    Registered User
    Join Date
    Oct 2011
    Posts
    847
    Well, instead of multiplying by the reciprocal, you can use slt or slti to evaluate
    Code:
    counter_offset = ( 5 - (value < 5000) ? 1 : 0) - (value < 4000) ? 1 : 0) - (value < 3000) ? 1 : 0) - (value < 2000) ? 1 : 0) - (value < 1000) ? 1 : 0) - (value < 0) ? 1 : 0) ) << 3
    which yields -1 if value < 0, 0 if 0<=value<1000, 1 if 1000<=value<2000, ..., 5 if 5000<=value. In other words, you reserve one word before the five counter elements for underflow, and one word to catch the overflow.

    This runs in 206 cycles for nine elements, with 18 RAW stalls, at 1.17 cycles per instruction:
    Code:
                .text
    
                andi    r0, r4, 0       ; r0=0, constant
                andi    r1, r4, 0       ; r1=0, offset to Input
                ld      r2, Length(r0)  ; r2=Length, limit to offset
                ori     r3, r0, 5       ; r3=5, number of Counter slots
    
    loop:
                ld      r4, Input(r1)   ; Load next value from Input
                slti    r15, r4, 5000   ; r15 = (r4 < 5000) ? 1 : 0
                slti    r14, r4, 4000   ; r14 = (r4 < 4000) ? 1 : 0
                dsub    r5, r3, r15     ; r5 = r3 - r15
                slti    r13, r4, 3000   ; r13 = (r4 < 3000) ? 1 : 0
                dsub    r5, r5, r14     ; r5 = r5 - r14
                slti    r12, r4, 2000   ; r12 = (r4 < 2000) ? 1 : 0
                dsub    r5, r5, r13     ; r5 = r5 - r13
                slti    r11, r4, 1000   ; r11 = (r4 < 1000) ? 1 : 0
                dsub    r5, r5, r12     ; r5 = r5 - r12
                slt     r10, r4, r0     ; r10 = (r4 <    0) ? 1 : 0
                dsub    r5, r5, r11     ; r5 = r5 - r11
                daddi   r1, r1, 8       ; Increase offset to next value
                dsub    r5, r5, r10     ; r5 = r5 - r10
                dsll    r5, r5, 3       ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.
                ld      r6, Counter(r5)
                daddi   r6, r6, 1
                sd      r6, Counter(r5)           
                bne     r1, r2, loop    ; Loop until all done
    
                halt
    
                .data
    
                ; Number of values * 8
    Length:     .word 72
    
                ; Values to classify
    Input:      .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999
    
                ; Counts for [0,999], ... [4000,4999].
                .word 0                 ; Underflow slot
    Counter:    .word 0, 0, 0, 0, 0     ; Data slots
                .word 0                 ; Overflow slot
    For 21 input values, it takes 470 cycles. Your code, with the modification, takes 395 cycles, using the same inputs.

    However, mine handles inputs outside the expected [0, 4999] range too. If I remove that, it runs in 386 cycles. (There are probably a few cycles to be saved by rearranging the instructions, or using more registers to avoid the 42 raw stalls I get with 21 input values -- two raw stalls per loop.)

    As it happens, using several slti + dsub pairs to calculate the index directly seems slightly faster -- for five target slots -- than the multiplication by reciprocal. Obviously, if you had fewer slots, it'd be even faster. But, if you had more slots, even just 6, it'd be slower than the multiplication method! (It's pretty interesting the exercise has hit the exact spot for this.)

    Because doing a histogram -- and that's what this is -- usually involves more than five slots, I'd say your code, the one that uses multiplications, is the better one, even if the one shown here is a tiny bit faster. The reason is that maintaining and adapting that code to changing needs is much easier, and changing the number of slots does not really affect the running time. In this code, changing the number of slots is pretty hairy, and will directly affect the running time.

    I wanted to show this full code because it is yet another approach to the same problem, and while it may be a tiny bit faster, I would not really recommend using it. (Thus, just telling you and others about it might give the wrong impression, and waste time in developing a solution that works, but is extremely fragile wrt. changing requirements.) For an exercise it might be okay as an answer -- I'd personally submit both, in case the instructor only looks at the timings, but so that you could pipe up at the exercise session and explain why the approach is not a good idea in the general case. (As an instructor, I'd personally give an extra point for that sort of important observation. I don't know if anybody else would, though.)

    I personally would not use this approach in real-world code, unless I was absolutely sure the number of slots would never change, and it was timing-critical code on a MIPS-based microcontroller. While the multiplication-via-reciprocal may be slower for two-to-five slots, it's much more robust, and in my opinion, the better solution.

    I hope you found this interesting!

  11. #11
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    As always, the case is full of trade offs! I found them interesting˛.. I am working on it.
    For N = 100, the code that does one mul goes better.
    Question:2147484 <- How did this come to play? ;p

    What also bothers me a bit, is that I have to load the counters in specific registers (r17-r21), but we have it -at both codes- in an array, thus, right before halting the program, I have to use 5 loads and 4 additions.. I can't figure out how I can run away of them!
    Last edited by std10093; 04-19-2013 at 02:31 PM.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  12. #12
    Registered User
    Join Date
    Oct 2011
    Posts
    847
    Quote Originally Posted by std10093 View Post
    Question:2147484 <- How did this come to play?
    It's the reciprocal of 1000 using 31-bit integers:

    231 / 1000 = 2147483648 / 1000 = 2147483.648 ≃ 2147484

    The value is rounded up, because that gives the correct integer result in a larger range of values.

    When you multiply by that value, then shift right 31 bits, you really multiply by
    2147484 / 231 = 0.00100000016391277313232421875, which is just over 1/1000 = 0.001.

    (If you rounded down, that factor would be less than 0.001, and you'd get incorrect integer result at 1000. That's why you round up, earlier, to get the factor at or above what you need, so that the result is correct in a wider range of inputs.)

    Quote Originally Posted by std10093 View Post
    What also bothers me a bit, is that I have to load the counters in specific registers (r17-r21), but we have it -at both codes- in an array, thus, right before halting the program, I have to use 5 loads and 4 additions.. I can't figure out how I can run away of them!
    Hm. You could always use the slti method to count
    r17 = number of occurrences < 1000
    r18 = number of occurrences < 2000
    r19 = number of occurrences < 3000
    r20 = number of occurrences < 4000
    r21 = number of occurrences < 5000
    directly, without storing in any kind of an array, then just before halt, subtract the sum of the lower counts from the upper counts, i.e.
    Code:
    r21 = r21 - r20
    r20 = r20 - r19
    r19 = r19 - r18
    r18 = r18 - r17
    which will yield the correct counters in the correct registers. Note that the order is important, because
    Code:
    count of 4000 .. 4999 = (count of less than 5000) - (count of less than 4000)
    count of 3000 .. 3999 = (count of less than 4000) - (count of less than 3000)
    count of 2000 .. 2999 = (count of less than 3000) - (count of less than 2000)
    and so on
    For the life of me, I can't think of why I didn't do that in my previous snippet.

  13. #13
    Registered User
    Join Date
    Oct 2011
    Posts
    847
    I just wrote and checked the code I described above in #12. It runs with just one "branch taken stall" per iteration, no other stalls (in particular, zero RAW stalls), and runs in about 17 × values + 20 clock cycles.

    Running some test runs, the histogram calculations take
    9 values: 173 cycles
    21 values: 377 cycles
    100 values: 1720 cycles

    That code is not optimal, although I did manage to interleave all the slti and dadd instructions to avoid all stalls. Since the code handles both underflows (values < 0) and overflows (values >= 5000), accumulating them in registers r16 and r22, respectively, it could be a lot faster if values are known to always be within 0..4999, inclusive. The code size is 132 bytes, and the code runs in 1.064 cycles per instruction on average.

    But, remember, there is room for improvement yet..

  14. #14
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,681
    I unrolled the loop(5 at a time, despite the fact that mul takes 7 stages) and made input check too.
    For N = 100
    #Stats: C = 933, I = 862, CPI = 1.082
    #Stalls: 1 RAW, 60 structural, 4 branch taken, 2 branch misprediction

    For the 2147484:
    >you'll get rA=data/1000 for 0<=data<=6100998 (post #7)
    How did you found the upper bound?

    As for the slti you said, in order to get rid of the array of counters and use the required registers, I did not understand you. Maybe the fact that it is 5:50 at the morning has some impact.

    //The fact that we have a number of cycles < 1000, should make us all feel great
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  15. #15
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    Re Nominal Animal's post #10
    Code:
                dsub    r5, r5, r10     ; r5 = r5 - r10
                dsll    r5, r5, 3       ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.
                ld      r6, Counter(r5)
                daddi   r6, r6, 1
                sd      r6, Counter(r5)          
                bne     r1, r2, loop    ; Loop until all done
     
                halt
    Search "delay slot" in http://www.weblearn.hs-bremen.de/ris...S/mips-isa.pdf
    On first reading, I thought "how is the halt not executed in the delay slot?" On RTFM, it seems there are two kinds of branch instructions, one of which does not make use of the delay slot.

    So the first tweak might be to use the delay slot executing branch by writing
    bne r1, r2, loop ; Loop until all done (use the other kind of bne!!)
    sd r6, Counter(r5)

    However, this might not be such a good idea for a couple of reasons.
    1. Apparently, the delayed instruction isn't used(?) if the branch is not taken.
    2. Some descriptions indicate that branch delay slot instructions are deprecated, so it might not even be available.

    The other issue is load delay slots, where load followed by use causes a stall. An example of this is the ld followed by addi
    With a little reordering, perhaps this.
    Code:
                .text
    
                andi    r0, r4, 0       ; r0=0, constant
                andi    r1, r4, 0       ; r1=0, offset to Input
                ld      r2, Length(r0)  ; r2=Length, limit to offset
                ori     r3, r0, 5       ; r3=5, number of Counter slots
    
                ld      r4, Input(r1)   ; Load FIRST value from Input
    loop:
                slti    r15, r4, 5000   ; r15 = (r4 < 5000) ? 1 : 0
                slti    r14, r4, 4000   ; r14 = (r4 < 4000) ? 1 : 0
                dsub    r5, r3, r15     ; r5 = r3 - r15
                slti    r13, r4, 3000   ; r13 = (r4 < 3000) ? 1 : 0
                dsub    r5, r5, r14     ; r5 = r5 - r14
                slti    r12, r4, 2000   ; r12 = (r4 < 2000) ? 1 : 0
                dsub    r5, r5, r13     ; r5 = r5 - r13
                slti    r11, r4, 1000   ; r11 = (r4 < 1000) ? 1 : 0
                dsub    r5, r5, r12     ; r5 = r5 - r12
                slt     r10, r4, r0     ; r10 = (r4 <    0) ? 1 : 0
                dsub    r5, r5, r11     ; r5 = r5 - r11
                daddi   r1, r1, 8       ; Increase offset to next value
                dsub    r5, r5, r10     ; r5 = r5 - r10
                dsll    r5, r5, 3       ; r5 = 8*r5: -8, 0, 8, 16, 24, 32, 40.
                ld      r6, Counter(r5)
                ld      r4, Input(r1)   ; Load NEXT value from Input
                daddi   r6, r6, 1
                sd      r6, Counter(r5)           
                bne     r1, r2, loop    ; Loop until all done
    
                halt
    
                .data
    
                ; Number of values * 8
    Length:     .word 72
    
                ; Values to classify
    Input:      .word 2555, 2556, 2557, 2665, 2666, 2667, 2668, 2669, 4999
    
                ; Counts for [0,999], ... [4000,4999].
                .word 0                 ; Underflow slot
    Counter:    .word 0, 0, 0, 0, 0     ; Data slots
                .word 0                 ; Overflow slot
    Putting the load of r4 just after the load of r6 serves two purposes
    - it kills off the load delay of r6 when it comes to the daddi
    - it also kills off the load delay of r4 when it comes to the slti (for all loop iterations)
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to count Branches
    By husslela2 in forum C Programming
    Replies: 35
    Last Post: 04-28-2010, 12:33 AM
  2. redraw after minimize
    By Prads in forum Windows Programming
    Replies: 6
    Last Post: 03-21-2010, 11:16 AM
  3. minimize and maximize
    By Wick in forum Windows Programming
    Replies: 2
    Last Post: 08-20-2003, 10:37 PM
  4. Minimize itself?
    By MassKid in forum C Programming
    Replies: 4
    Last Post: 02-06-2002, 05:47 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21