Thread: How many CPU cycles (ARM) for add vs if..then

  1. #1
    Registered User
    Join Date
    Oct 2004
    Posts
    36

    How many CPU cycles (ARM) for add vs if..then

    Hi all,

    Once again the ARM guy is back with another random question

    I am implementing a blitting routine that supports transparency (ON..OFF only, not full levels). I am wondering if anybody knows which of the following techniques is faster in terms of actual cycles etc, because I have been told to stay away from conditional statement like the plaque when it comes to optimisation, but I'm wondering how far to take it (how many cycles does a conditional take anyway??).

    Method 1:
    Before blitting a pixel, simply check whther the colour is the transparent colour, and if it is, don't blit it:
    Code:
    for (i = 0 to all_pixels){
      colour = source[i];
      if (colour!=trans_colour){  dest[i] = colour;  }
    }
    Method 2:
    I have a mask the same size as the image. Each 'pixel' in the mask is either fully off (00000000) or fully on (11111111). I combine the source and destinatin using the mask and the inverse mask (one is always zero) to get the following code:
    Code:
    for (i = 0 to all_pixels){
      maskbyte = mask[i];
      dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte); // ~ is inverse
    }
    So replacing a conditional with two ANDs, an addition and a binary inverse. Which is faster (i.e. less cycles)?

    NOTE: Please ignore the added mem requirements of the mask, this is a diffferent argument.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Why don't you time it and find out?

    Just put the code in a loop which runs say 500M times, and the one which takes less time is the winner.

    My guess is door number 1
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    The C-er
    Join Date
    Mar 2004
    Posts
    192
    That's my guess too. Since the ARM isn't super-scalar (i.e. has only one execution unit) branches shouldn't cause the pipeline to stall much. The ARM deals with branches efficiently IIRC.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Or simply do "gcc -S prog.c" on each block of code and see how big it looks.

    Code:
    for (i = 0 to all_pixels){
      colour = source[i];
      if (colour!=trans_colour){  dest[i] = colour;  }
    }
    Which is array_load + compare + array_store(?perhaps)
    OK, lets say 1.5 array accesses per loop.

    Code:
    for (i = 0 to all_pixels){
      maskbyte = mask[i];
      dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte); // ~ is inverse
    }
    4 array accesses, 3 bit-wise ops, 2 assignments and an addition in a pear tree.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Of course, you can eliminate the array accesses by walking through with a couple of pointers. And I don't know about ARM, but in x86 bitwise ops are generally the fastest ones. For instance, x ^= x is faster than x = 0.
    If you understand what you're doing, you're not learning anything.

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > you can eliminate the array accesses by walking through with a couple of pointers.
    What do you think walking pointers achieve that arrays don't?
    Both boil down to computing an address and dereferencing that address.

    Examine the code produced by loops containing
    for ( i = 0 ; i < 10 ; i++ ) a[i] = i;
    for ( i = 0 ; i < 10 ; i++ ) *(a+i) = i;
    for ( p = a, i = 0 ; i < 10 ; i++, p++ ) *p = i;
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    What do you think walking pointers achieve that arrays don't?
    But incrementation is faster than addition plus assignment (at least as far as I've been told). And using a pointer saves you from multiple additions.

    For example, he originally had:
    Code:
    dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte);
    The dest references break down to:
    Code:
    *(dest+i) = (source[i] & maskbyte)+(*(dest+i) & ~maskbyte);
    With a pointer you would replace both additions with a single incrementation.
    Code:
    *destptr = (source[i] & maskbyte)+(*destptr & ~maskbyte);
    Clear now?

    EDIT: Come to think of it, you'd save some time on source also. With an array you have to increment i and then you also have to add i to source and then dereference, but with a pointer you just have the incrementation and dereference without the addition. So by using pointers you're saving 3 additions per loop. And, Salem, I'd hoped that by now you'd realized that I know how array degradation works.
    Last edited by itsme86; 01-26-2005 at 12:25 PM.
    If you understand what you're doing, you're not learning anything.

  8. #8
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    I did some fiddling with GCC.
    Code:
    // for ( i = 0 ; i < 10 ; i++ ) a[i] = 0;
    
        cmpl    $9, -12(%ebp)    // compare(9, i)
        jle     L18             // if <= goto L18
        jmp     L16             // goto L16
    L18:
        movl    -12(%ebp), %eax      // %eax = i
        movl    $0, -72(%ebp,%eax,4) // a[i] = 0
        leal    -12(%ebp), %eax      // %eax = &i
        incl    (%eax)               // i++
        jmp     L15                  // goto L15
    Code:
    // for ( p = a ; p != end ; p++ ) *p = 0;
    
        movl    -76(%ebp), %eax // %eax = p
        cmpl    -80(%ebp), %eax // compare(end, p)
        jne     L22             // if != goto L22
        jmp     L14             // goto L14
    L22:
        movl    -76(%ebp), %eax // %eax = p
        movl    $0, (%eax)      // *p = 0
        leal    -76(%ebp), %eax // %eax = &p
        addl    $4, (%eax)      // p++ (pointer arithmitic)
        jmp     L19             // goto L19
    Optimisation (-O2) makes a big difference.
    Code:
    // for ( i = 0 ; i < 10 ; i++ ) a[i] = 0;
    
    L28:
        movl    $0, -56(%ebp,%eax,4) // *(a + i) = 0
        incl    %eax                 // i++
        cmpl    $9, %eax             // compare(9, i)
        jle     L28                  // if <= goto L28
    Code:
    // for ( p = a ; p != end ; p++ ) *p = 0;
    
    L33:
        movl    $0, (%eax) // *p=0
        addl    $4, %eax   // p++
        cmpl    %edx, %eax // compare(end, p)
        jne     L33        // if != goto L33
    There is definitely an additional add in there for the index subscripting, but whether there is a difference in clock cycles, someone else will have to tell.

    And finally the original example:
    Code:
    void test3(void)
    {
        int i;
        unsigned char dest[10], source[10];
        unsigned char maskbyte = 5;
    
        for ( i = 0; i < 10 ; i++ )
        {
            dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte);
        }
    }
    Here is the optimised assembly:
    Code:
        movb    -24(%ebp,%ecx), %dl  // %dl = dest[i]
        movb    -40(%ebp,%ecx), %al  // %al = source[i]
    
        andl    $5, %eax             // %eax = %eax & maskbyte
                                     // %eax = source[i] & maskbyte
                                     // Note: %al is the low byte of %eax
    
        andl    $-6, %edx            // %edx = %edx & ~maskbyte
                                     // %edx = dest[i] & ~maskbyte
                                     // Note: %dl is the low byte of %edx
    
        addl    %edx, %eax           // %eax = %eax + %edx
                                     // (source[i] & maskbyte)+(dest[i] & ~maskbyte)
    
        movb    %al, -24(%ebp,%ecx)  // dest[i] = %al
                                     // Note: %al is the low byte of %eax
    
        incl    %ecx                 // i++
        cmpl    $9, %ecx             // compare(9, i)
        jle     L43                  // if <= goto L43
    I'm not sure if that settles any arguments, but it taught me a bit more about reading assembly.

  9. #9
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

    Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

    The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

    Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #10
    Registered User
    Join Date
    Oct 2004
    Posts
    36
    Jeez, just a big bag full of happy fluffies aren't you salem

    Thanks for all the help and debate guys. I compile with the O2 flag in gcc for ARM. I will compile the code and get back with some performance figures. Just thought it would be good to get some knowledgeable feedback in case anyone was familiar with the amount of cycles taken by a branching instruction etc.

    BTW - Obviously my code was very cut down to give you an idea, and my actual code uses an incremental pointer to step through memory. Don't take the 'example' code too literally. I didn't want to paste 100 lines where a simple example would suffice

  11. #11
    .
    Join Date
    Nov 2003
    Posts
    307
    On HPUX 11.0 HP's cc will actually 'lose' optimization levels on obfuscated code, as Salem indicates. When in doubt, leave it out - this compiler's motto.

    Clear simple code usually results in the compiler being more 'fearless' about optimization.
    The HP CC manual lists about 15 optimization pragmas, and discusses all of the strategies the compiler will employ if it thinks it needs to, in order to reach fully optimized code. The first paragraph in this section emphasizes the use of simple, clear code as the best starting point for the compiler to make these decisions.

  12. #12
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Quote Originally Posted by Salem
    For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

    Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

    The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

    Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.
    Well then I guess you should have argued that point originally instead of trying to say that the arrays would degrade to the same thing, huh?
    If you understand what you're doing, you're not learning anything.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. pthread question
    By quantt in forum Linux Programming
    Replies: 7
    Last Post: 04-07-2009, 01:21 AM
  2. questions on multiple thread programming
    By lehe in forum C Programming
    Replies: 11
    Last Post: 03-27-2009, 07:44 AM
  3. Upgrading my old CPU (for another old one!)
    By foxman in forum Tech Board
    Replies: 16
    Last Post: 01-11-2008, 05:41 PM
  4. CPU temp
    By PING in forum Tech Board
    Replies: 5
    Last Post: 01-28-2006, 06:25 AM
  5. approx. # CPU cycles
    By MadCow257 in forum C++ Programming
    Replies: 8
    Last Post: 01-05-2005, 04:39 PM