1. ## Different ways to add

I made a function to add two numbers together using bit manipulation. I started off writing the function add_a, then made a modified version add_b.

add_a seems to be 10-15&#37; faster than add_b. The thing is is that in add_b I removed a variable, and a reset and a copy from in the loop. So why does it appear slower?
Code:
```#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <time.h>
#define Uint8 unsigned char

Uint8 add_a(Uint8 a, Uint8 b)
{
static Uint8 i, num = 0;
static Uint8 result = 0;
static Uint8 carry = 0;
static Uint8 bit = 1;

for(i=0; i<CHAR_BIT; i++)
{
num = 0;
if(a & bit) num++;
if(b & bit) num++;
if(carry)   num++;
if(num & 1) result |= bit;
carry = (num & 2);
bit <<= 1;
}

return result;
}

Uint8 add_b(Uint8 a, Uint8 b)
{
static Uint8 i, num = 0;
static Uint8 result = 0;
static Uint8 bit = 1;

for(i=0; i<CHAR_BIT; i++)
{
if(a & bit)num++;
if(b & bit)num++;
if(num & 1) result |= bit;
num = (num & 2) ? 1: 0;
bit <<= 1;
}
return result;
}

Uint8 add_c(Uint8 a, Uint8 b)
{
static Uint8 i, num = 0;
static Uint8 result = 0;
static Uint8 bit = 1;

for(i=0; i<CHAR_BIT; i++)
{
if(a & bit)
if(b & bit)
num += 2;
else
num += 1;
else
if(b & bit)
num += 1;

if(num & 1) result |= bit;
num = (num & 2) ? 1: 0;
bit <<= 1;
}
return result;
}

void test(int x, int y)
{
unsigned start = clock(), i;
for(i=0; i<50000000; i++) add_a(x, y);
printf("A: <Time %u> <Result %u>\n", clock()-start, add_a(x, y));
start = clock();
for(i=0; i<50000000; i++) add_b(x, y);
printf("B: <Time %u> <Result %u>\n", clock()-start, add_b(x, y));
start = clock();
for(i=0; i<50000000; i++) add_c(x, y);
printf("C: <Time %u> <Result %u>\n", clock()-start, add_c(x, y));
}

int main()
{
Uint8 a = 102;
Uint8 b = 57;
test(a, b);
getchar();
}```

2. probably branch prediction on if(carry) was better when on (num & 2) ? 1: 0;

to make more exsact decision you need to use tool like VTune of Intel

3. Try removing those "static" and see if that makes any difference. It should help the compiler optimize those loops a bit further because there is no need to STORE the old value. [If you want carry to carry through to the next level, then you need that static].

And try this
Code:
`         result |= (num >> i) & 1;`
This may be better too:
Code:
```// instead of:  num = (num & 2) ? 1: 0;
num = (num >> 1) & 1;```

--
Mats

4. Just out of curiosity, what's wrong with plain old: a + b?

5. vart: Yeah I guess my testing method dosent say a whole lot. I had a look at VTune. I got an AMD tho.

matsp: Removing the static declaration dosent seem to make much difference. In fact I only added the 'static' bit at the end as it made the function much faster when repeatedly called. I don't know if thats a good thing tho. I don't want to store the value; it just saves time allocating space for the variables (or thats the impression I get). Also, shifting then &ing dident seem to make much difference either, but thanks for the suggestion I hadent thought of that.

cpjust: I was planning on using it (or something like it) for adding big numbers which would be made out of a string of bytes.

6. On my AMD processor, the B version was the fastest by about 10% - but it probably depends on which compiler and which exact model of processor you have. This was using Visual Studio .Net.

I noticed when I was trying it out, that the static seems to help performance. I don't really understand why tho'. But I didn't spend much time studying the generated assembly code, just trying some random changes out. [I managed to get it to run in "zero" time, but that was just a case of the optimizer completely removing the function].

I would just go with one of your versions for now, if you are planning to make it a bigger thing - you may find that the best optimization is to do something "further up the chain" to make bigger benefits, rather than micro-optimizing this function "to death".

--
Mats

7. Yeah, I guess its hardware/compiler specific. I'm using GCC. Also I had a go at using declaring variables as 'register' instead of 'static'. (Something I found in the bowels of the GNU C language extensions). Bizzarely it makes it a whole lot slower than normal variables even when I am only requesting one register.

I think you are probably right in that theres little point worrying about small speed differences here. I think i may go with version a as returning the carry would be handy, although if its static its not like it really matters too much anyway.

Cheers.

8. Also be aware that using static may prevent the function from inlining, so if that's a goal later on, using the "slower" version with non-static variables may turn out to be faster in the long term.

Using register is likely to "confuse" the compilers optimizer. It was a good thing in the past when the compilers weren't very good at figuring out which register can do what. Nowadays, the optimizer is most likely better than us to figure out which register to use for what and when [also, you assigning a variable register may actually prevent ANOTHER variable or temporary result that is more frequently used from using a register, which of course is not what you wanted].

By the way, which verison of gcc is this?

For an alternative to VTune, try AMD's CodeAnalyst, or oprofile if you are on Linux. It's a while since I last used it, but CA is able to tell you exactly how long each instruction is taking.

--
Mats

9. Yeah, it seems better to forget about the register thing. Tbh I'm not sure exactly which version gcc i'm using. I think its an old one, like 3.4.2 or something. IE: the version that came with Dev C++. Never bothered updating it.

Just tried downloading CodeAnalyst from AMD's site, but it gave me some XML parsing error thing. Still might see if I can get hold of it elsewhere. Might be fun to have a play around with.

10. I can't download it either - I'm reporting to the support page on the CA page.

gcc 3.4.2 may not have the optimization to replace branches with conditional execution code, which could slow things down a bit in this type of code.

--
Mats

11. Hmm. Maybe I should update it. I have gcc 4.2.2 on my linux computer, I'll test it on that later see if it makes a difference.

Cheers.

12. I really wouldn't worry about it right now - I'm not sure there is a mingw version that works well from gcc 4.x anyways. But if it's easy to check, why not...

--
Mats

Popular pages Recent additions