# How many CPU cycles (ARM) for add vs if..then

• 01-26-2005
Pea
How many CPU cycles (ARM) for add vs if..then
Hi all,

Once again the ARM guy is back with another random question :)

I am implementing a blitting routine that supports transparency (ON..OFF only, not full levels). I am wondering if anybody knows which of the following techniques is faster in terms of actual cycles etc, because I have been told to stay away from conditional statement like the plaque when it comes to optimisation, but I'm wondering how far to take it (how many cycles does a conditional take anyway??).

Method 1:
Before blitting a pixel, simply check whther the colour is the transparent colour, and if it is, don't blit it:
Code:

```for (i = 0 to all_pixels){   colour = source[i];   if (colour!=trans_colour){  dest[i] = colour;  } }```
Method 2:
I have a mask the same size as the image. Each 'pixel' in the mask is either fully off (00000000) or fully on (11111111). I combine the source and destinatin using the mask and the inverse mask (one is always zero) to get the following code:
Code:

```for (i = 0 to all_pixels){   maskbyte = mask[i];   dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte); // ~ is inverse }```
So replacing a conditional with two ANDs, an addition and a binary inverse. Which is faster (i.e. less cycles)?

• 01-26-2005
Salem
Why don't you time it and find out?

Just put the code in a loop which runs say 500M times, and the one which takes less time is the winner.

My guess is door number 1
• 01-26-2005
Jez
That's my guess too. Since the ARM isn't super-scalar (i.e. has only one execution unit) branches shouldn't cause the pipeline to stall much. The ARM deals with branches efficiently IIRC.
• 01-26-2005
Salem
Or simply do "gcc -S prog.c" on each block of code and see how big it looks.

Code:

```for (i = 0 to all_pixels){   colour = source[i];   if (colour!=trans_colour){  dest[i] = colour;  } }```
Which is array_load + compare + array_store(?perhaps)
OK, lets say 1.5 array accesses per loop.

Code:

```for (i = 0 to all_pixels){   maskbyte = mask[i];   dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte); // ~ is inverse }```
4 array accesses, 3 bit-wise ops, 2 assignments and an addition in a pear tree.
• 01-26-2005
itsme86
Of course, you can eliminate the array accesses by walking through with a couple of pointers. And I don't know about ARM, but in x86 bitwise ops are generally the fastest ones. For instance, x ^= x is faster than x = 0.
• 01-26-2005
Salem
> you can eliminate the array accesses by walking through with a couple of pointers.
What do you think walking pointers achieve that arrays don't?

Examine the code produced by loops containing
for ( i = 0 ; i < 10 ; i++ ) a[i] = i;
for ( i = 0 ; i < 10 ; i++ ) *(a+i) = i;
for ( p = a, i = 0 ; i < 10 ; i++, p++ ) *p = i;
• 01-26-2005
itsme86
Quote:

What do you think walking pointers achieve that arrays don't?
But incrementation is faster than addition plus assignment (at least as far as I've been told). And using a pointer saves you from multiple additions.

Code:

`dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte);`
The dest references break down to:
Code:

`*(dest+i) = (source[i] & maskbyte)+(*(dest+i) & ~maskbyte);`
With a pointer you would replace both additions with a single incrementation.
Code:

`*destptr = (source[i] & maskbyte)+(*destptr & ~maskbyte);`
Clear now?

EDIT: Come to think of it, you'd save some time on source also. With an array you have to increment i and then you also have to add i to source and then dereference, but with a pointer you just have the incrementation and dereference without the addition. So by using pointers you're saving 3 additions per loop. And, Salem, I'd hoped that by now you'd realized that I know how array degradation works. :)
• 01-26-2005
anonytmouse
I did some fiddling with GCC.
Code:

```// for ( i = 0 ; i < 10 ; i++ ) a[i] = 0;     cmpl    \$9, -12(%ebp)    // compare(9, i)     jle    L18            // if <= goto L18     jmp    L16            // goto L16 L18:     movl    -12(%ebp), %eax      // %eax = i     movl    \$0, -72(%ebp,%eax,4) // a[i] = 0     leal    -12(%ebp), %eax      // %eax = &i     incl    (%eax)              // i++     jmp    L15                  // goto L15```
Code:

```// for ( p = a ; p != end ; p++ ) *p = 0;     movl    -76(%ebp), %eax // %eax = p     cmpl    -80(%ebp), %eax // compare(end, p)     jne    L22            // if != goto L22     jmp    L14            // goto L14 L22:     movl    -76(%ebp), %eax // %eax = p     movl    \$0, (%eax)      // *p = 0     leal    -76(%ebp), %eax // %eax = &p     addl    \$4, (%eax)      // p++ (pointer arithmitic)     jmp    L19            // goto L19```
Optimisation (-O2) makes a big difference.
Code:

```// for ( i = 0 ; i < 10 ; i++ ) a[i] = 0; L28:     movl    \$0, -56(%ebp,%eax,4) // *(a + i) = 0     incl    %eax                // i++     cmpl    \$9, %eax            // compare(9, i)     jle    L28                  // if <= goto L28```
Code:

```// for ( p = a ; p != end ; p++ ) *p = 0; L33:     movl    \$0, (%eax) // *p=0     addl    \$4, %eax  // p++     cmpl    %edx, %eax // compare(end, p)     jne    L33        // if != goto L33```
There is definitely an additional add in there for the index subscripting, but whether there is a difference in clock cycles, someone else will have to tell.

And finally the original example:
Code:

```void test3(void) {     int i;     unsigned char dest[10], source[10];     unsigned char maskbyte = 5;     for ( i = 0; i < 10 ; i++ )     {         dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte);     } }```
Here is the optimised assembly:
Code:

```    movb    -24(%ebp,%ecx), %dl  // %dl = dest[i]     movb    -40(%ebp,%ecx), %al  // %al = source[i]     andl    \$5, %eax            // %eax = %eax & maskbyte                                 // %eax = source[i] & maskbyte                                 // Note: %al is the low byte of %eax     andl    \$-6, %edx            // %edx = %edx & ~maskbyte                                 // %edx = dest[i] & ~maskbyte                                 // Note: %dl is the low byte of %edx     addl    %edx, %eax          // %eax = %eax + %edx                                 // (source[i] & maskbyte)+(dest[i] & ~maskbyte)     movb    %al, -24(%ebp,%ecx)  // dest[i] = %al                                 // Note: %al is the low byte of %eax     incl    %ecx                // i++     cmpl    \$9, %ecx            // compare(9, i)     jle    L43                  // if <= goto L43```
I'm not sure if that settles any arguments, but it taught me a bit more about reading assembly.
• 01-26-2005
Salem
For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.
• 01-26-2005
Pea
Jeez, just a big bag full of happy fluffies aren't you salem :)

Thanks for all the help and debate guys. I compile with the O2 flag in gcc for ARM. I will compile the code and get back with some performance figures. Just thought it would be good to get some knowledgeable feedback in case anyone was familiar with the amount of cycles taken by a branching instruction etc.

BTW - Obviously my code was very cut down to give you an idea, and my actual code uses an incremental pointer to step through memory. Don't take the 'example' code too literally. I didn't want to paste 100 lines where a simple example would suffice :)
• 01-26-2005
jim mcnamara
On HPUX 11.0 HP's cc will actually 'lose' optimization levels on obfuscated code, as Salem indicates. When in doubt, leave it out - this compiler's motto.

Clear simple code usually results in the compiler being more 'fearless' about optimization.
The HP CC manual lists about 15 optimization pragmas, and discusses all of the strategies the compiler will employ if it thinks it needs to, in order to reach fully optimized code. The first paragraph in this section emphasizes the use of simple, clear code as the best starting point for the compiler to make these decisions.
• 01-26-2005
itsme86
Quote:

Originally Posted by Salem
For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.

Well then I guess you should have argued that point originally instead of trying to say that the arrays would degrade to the same thing, huh? ;)