Speeding up addition in an array

**tomservo** · 08-20-2010

Not sure that subject does this justice, but here goes:

I've got an array of bytes, and I want to subtract the same number from all of the values in this array.

So, what I want to do to speed this process up is to do multiple bytes at a time by loading them into a unsigned int, for example, so I can (in this case) work on 4 bytes at a time.

I'm attempting to do this via bitwise addition (on ~n1+1) manipulation ie: (n1^n2)^((n1&n2)<<1) -- yes, I know this isn't the exact value...my routine works correctly...just trying to state what I am doing.

When I subtract, I'm using (value*256*256*256)+(value*256*256)+(value*256)+va lue as the number to subtract from the four byte block.

Anyway...the problem (of course) is that there are bits carried from one byte to the next.

Is there any easy way to get around this...or should I just go byte by byte and subtract the value.

Thanks!

**C_ntua** · 08-20-2010

Let me give my comments before actually answering the question:
Frankly, what is fast and slow depends on the CPU. The CPU will have some Arithmetic Logic Units that can do subtractions, multiplications etc etc.

Now, it depends on the minimum unit the ALU supports. If it supports a 32 bit unit only then if you pass it a byte it will simply fill up the rest of the bytes, for example.

So, assuming that the ALU supports 32bit (an int lets say) you want to take advantage of this when you subtract and pass 4 bytes. Fine until now. The tricky part is this: is the an additional cost to do so? If there is then you are gaining some time, but losing some other. Typically you want gain anything in the end. Might be slower. Will make the compiler optimization process more difficult and any other kind of optimization as well.

What I mean is this.

Code:

char value[4];
....
int optValue = (int)value[3] << 12 | (int)value[2] << 8 | (int)value[1] << 4 | (int)value[0];

this is nice, but the bit shifts, the dereference, the casting all this might take more time than the actual subtraction.

Of course you wouldn't do the above, you would do

Code:

char value[4];
int* optValue = (int*)value;

But what about when the number you subtract from this?
If you do as before

Code:

char value;
....
int optValue = (int)value << 12 | (int)value << 8 | (int)value << 4 | (int)value;

Then you will again lose cycles and it might really not be worth it.
The best way would be to hardcode all values from 0 to 255. So

Code:

#define B_255 0xFFFFFFFF
#define B_254 0xFEFEFEFE
....

then subtract normally.
To use them you should do

Code:

char* value = (char*)optValue;
// value[2] would be the second byte

which I guess you already do.

Note that endianess might get into play and ruin your attempt. This of course can be solved.

Now the actual question:
There is partial way around this. Use 2 bytes per integer. Then you will have
#junk#value1#junk#value2
So you have 8 bits of junk that will have the value of the carriers, which you don't care about.

**nonoob** · 08-20-2010

If you're willing to switch into assembler you might be able to use the MMX instruction set. The PADDB instruction will do 64 bits (8 bytes) or 128 bits (16 bytes) at a time and the bytes don't spill over to their neighbors.

**Adak** · 08-20-2010

Before you look into these kind of optimizations, look VERY carefully at your overall algorithm. THAT's where the BIG improvements might be waiting to be discovered.

Then get a good test set of data, that fairly represents your typical data, and get a base time for that run, just using standard C addition and subtraction. A run through a profiler would be great to have at this time.

Now, start testing your optimizations, one at time, on that same test set. Take the functions you believe you can optimize, in descending order of the time they use, in the profiler.

There's no sense in spending hours to save, at most, a few hundredths of a second, at most.

If you're doing all the above already, great.

**iMalc** · 08-20-2010

Indeed. This problem is extremely well suited to MMX. It's practically exactly what it was made for.
I've done several routines like this in MMX and SSE2. What's the code you have now and what compiler is this for?

**C_ntua** · 08-20-2010

Originally Posted by nonoob

If you're willing to switch into assembler you might be able to use the MMX instruction set. The PADDB instruction will do 64 bits (8 bytes) or 128 bits (16 bytes) at a time and the bytes don't spill over to their neighbors.

Assuming your CPU supports them.

This might or might not give improved performance. You are generally competing with the compiler when you do so. Compilers have to try and use all the features of a CPU, try to send the commands in a specific order to utilize things like pipelines or any parallel process logic. That doesn't mean of course that using assembly won't give you a boost performance, since you don't have a command like PADDB in C. If the compiler will use this command for you is unknown.

You can try it out though. Use assembly code in your program or take the assembly code that is created from your compiler and change the section that adds the bytes.

I would bet that you would actually gain some speedup since this command suits perfectly what you want to do.

**Subsonics** · 08-20-2010

After optimizing the algorithm, don't forget the optimize flags of your compiler. You might find that what comes out that is hard to beat. Other than that you could try the -S flag (assuming GCC) to inspect the assembly generated by the compiler, perhaps you can spot things you can optimize further.

**tomservo** · 08-20-2010

I'm going to look at using some inline assembly for this (thanks nonoob). A learning process for me as I've never done something like this before, but that's a good thing because even at 44, it's always a good thing to learn something new.

It's been a long time since I've worked in assembly, and never worked with MMX (which I have available), so I didn't know something like PADDB existed.

PADDB looks to be the trick and being able to do 8 or even 16 bytes at a time would speed things up greatly, I *think*. I'll keep you posted on my progress.

This is just a pet project for me, not something for work or anything, so spending some extra time on it isn't a problem.

**tomservo** · 08-20-2010

I've got it coded...took a little while doing research on google for the correct syntax for inline asm, and for the mmx commands...

Now I just need to get some input so I can test it for speed. Will report back.

**tomservo** · 08-20-2010

Oh yes...*MUCH* faster...I'm just doing 64-bit adds instead of 128-bit...it's almost twice as fast. Very awesome...thanks again nonoob!! I wouldn't have ever thought of that.

Glad I joined here...hopefully I can get other good ideas and lend some help along the way as well.

**nonoob** · 08-24-2010

Thanks, tomservo, for keeping us up to date. I'd like to know what it's for... but that's outside the scope of this forum. Send me a message if you want. Anyway, I'm glad it worked out.

Thread: Speeding up addition in an array

Thread Tools

Search Thread

Display

Speeding up addition in an array

Similar Threads

from 2D array to 1D array

Help with Searching Array for Longest Sequence of an Element

Unknown Memory Leak in Init() Function

Quick question about SIGSEGV

Array Program