1. ## How many CPU cycles (ARM) for add vs if..then

Hi all,

Once again the ARM guy is back with another random question

I am implementing a blitting routine that supports transparency (ON..OFF only, not full levels). I am wondering if anybody knows which of the following techniques is faster in terms of actual cycles etc, because I have been told to stay away from conditional statement like the plaque when it comes to optimisation, but I'm wondering how far to take it (how many cycles does a conditional take anyway??).

Method 1:
Before blitting a pixel, simply check whther the colour is the transparent colour, and if it is, don't blit it:
Code:
```for (i = 0 to all_pixels){
colour = source[i];
if (colour!=trans_colour){  dest[i] = colour;  }
}```
Method 2:
I have a mask the same size as the image. Each 'pixel' in the mask is either fully off (00000000) or fully on (11111111). I combine the source and destinatin using the mask and the inverse mask (one is always zero) to get the following code:
Code:
```for (i = 0 to all_pixels){
}```
So replacing a conditional with two ANDs, an addition and a binary inverse. Which is faster (i.e. less cycles)?

2. Why don't you time it and find out?

Just put the code in a loop which runs say 500M times, and the one which takes less time is the winner.

My guess is door number 1

3. That's my guess too. Since the ARM isn't super-scalar (i.e. has only one execution unit) branches shouldn't cause the pipeline to stall much. The ARM deals with branches efficiently IIRC.

4. Or simply do "gcc -S prog.c" on each block of code and see how big it looks.

Code:
```for (i = 0 to all_pixels){
colour = source[i];
if (colour!=trans_colour){  dest[i] = colour;  }
}```
Which is array_load + compare + array_store(?perhaps)
OK, lets say 1.5 array accesses per loop.

Code:
```for (i = 0 to all_pixels){
}```
4 array accesses, 3 bit-wise ops, 2 assignments and an addition in a pear tree.

5. Of course, you can eliminate the array accesses by walking through with a couple of pointers. And I don't know about ARM, but in x86 bitwise ops are generally the fastest ones. For instance, x ^= x is faster than x = 0.

6. > you can eliminate the array accesses by walking through with a couple of pointers.
What do you think walking pointers achieve that arrays don't?

Examine the code produced by loops containing
for ( i = 0 ; i < 10 ; i++ ) a[i] = i;
for ( i = 0 ; i < 10 ; i++ ) *(a+i) = i;
for ( p = a, i = 0 ; i < 10 ; i++, p++ ) *p = i;

7. What do you think walking pointers achieve that arrays don't?
But incrementation is faster than addition plus assignment (at least as far as I've been told). And using a pointer saves you from multiple additions.

Code:
`dest[i] = (source[i] & maskbyte)+(dest[i] & ~maskbyte);`
The dest references break down to:
Code:
`*(dest+i) = (source[i] & maskbyte)+(*(dest+i) & ~maskbyte);`
With a pointer you would replace both additions with a single incrementation.
Code:
`*destptr = (source[i] & maskbyte)+(*destptr & ~maskbyte);`
Clear now?

EDIT: Come to think of it, you'd save some time on source also. With an array you have to increment i and then you also have to add i to source and then dereference, but with a pointer you just have the incrementation and dereference without the addition. So by using pointers you're saving 3 additions per loop. And, Salem, I'd hoped that by now you'd realized that I know how array degradation works.

8. I did some fiddling with GCC.
Code:
```// for ( i = 0 ; i < 10 ; i++ ) a[i] = 0;

cmpl    \$9, -12(%ebp)    // compare(9, i)
jle     L18             // if <= goto L18
jmp     L16             // goto L16
L18:
movl    -12(%ebp), %eax      // %eax = i
movl    \$0, -72(%ebp,%eax,4) // a[i] = 0
leal    -12(%ebp), %eax      // %eax = &i
incl    (%eax)               // i++
jmp     L15                  // goto L15```
Code:
```// for ( p = a ; p != end ; p++ ) *p = 0;

movl    -76(%ebp), %eax // %eax = p
cmpl    -80(%ebp), %eax // compare(end, p)
jne     L22             // if != goto L22
jmp     L14             // goto L14
L22:
movl    -76(%ebp), %eax // %eax = p
movl    \$0, (%eax)      // *p = 0
leal    -76(%ebp), %eax // %eax = &p
addl    \$4, (%eax)      // p++ (pointer arithmitic)
jmp     L19             // goto L19```
Optimisation (-O2) makes a big difference.
Code:
```// for ( i = 0 ; i < 10 ; i++ ) a[i] = 0;

L28:
movl    \$0, -56(%ebp,%eax,4) // *(a + i) = 0
incl    %eax                 // i++
cmpl    \$9, %eax             // compare(9, i)
jle     L28                  // if <= goto L28```
Code:
```// for ( p = a ; p != end ; p++ ) *p = 0;

L33:
movl    \$0, (%eax) // *p=0
cmpl    %edx, %eax // compare(end, p)
jne     L33        // if != goto L33```
There is definitely an additional add in there for the index subscripting, but whether there is a difference in clock cycles, someone else will have to tell.

And finally the original example:
Code:
```void test3(void)
{
int i;
unsigned char dest[10], source[10];

for ( i = 0; i < 10 ; i++ )
{
}
}```
Here is the optimised assembly:
Code:
```    movb    -24(%ebp,%ecx), %dl  // %dl = dest[i]
movb    -40(%ebp,%ecx), %al  // %al = source[i]

andl    \$5, %eax             // %eax = %eax & maskbyte
// %eax = source[i] & maskbyte
// Note: %al is the low byte of %eax

andl    \$-6, %edx            // %edx = %edx & ~maskbyte
// %edx = dest[i] & ~maskbyte
// Note: %dl is the low byte of %edx

addl    %edx, %eax           // %eax = %eax + %edx

movb    %al, -24(%ebp,%ecx)  // dest[i] = %al
// Note: %al is the low byte of %eax

incl    %ecx                 // i++
cmpl    \$9, %ecx             // compare(9, i)
jle     L43                  // if <= goto L43```
I'm not sure if that settles any arguments, but it taught me a bit more about reading assembly.

9. For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.

10. Jeez, just a big bag full of happy fluffies aren't you salem

Thanks for all the help and debate guys. I compile with the O2 flag in gcc for ARM. I will compile the code and get back with some performance figures. Just thought it would be good to get some knowledgeable feedback in case anyone was familiar with the amount of cycles taken by a branching instruction etc.

BTW - Obviously my code was very cut down to give you an idea, and my actual code uses an incremental pointer to step through memory. Don't take the 'example' code too literally. I didn't want to paste 100 lines where a simple example would suffice

11. On HPUX 11.0 HP's cc will actually 'lose' optimization levels on obfuscated code, as Salem indicates. When in doubt, leave it out - this compiler's motto.

Clear simple code usually results in the compiler being more 'fearless' about optimization.
The HP CC manual lists about 15 optimization pragmas, and discusses all of the strategies the compiler will employ if it thinks it needs to, in order to reach fully optimized code. The first paragraph in this section emphasizes the use of simple, clear code as the best starting point for the compiler to make these decisions.

12. Originally Posted by Salem
For me, the main point is (and has been for a long while) that its a complete waste of time trying to second-guess what the optimiser is going to do at any given -Olevel for any given platform (x86, ARM, whatever).

Mostly, its that if you write clear code which is easy to understand, then there is a good chance that the optimiser will see that as well and make a good job of optimising.

The reverse is some obfuscated mess based on 10-year old heresay "advice" which results in an unreadable mess and poor optimisation results.

Besides, there are usually way bigger performance fish to be caught than worrying about whether array vs. pointer access is better.
Well then I guess you should have argued that point originally instead of trying to say that the arrays would degrade to the same thing, huh?