Thread: Speeding up addition in an array

  1. #1
    Registered User
    Join Date
    Aug 2010
    Posts
    4

    Speeding up addition in an array

    Not sure that subject does this justice, but here goes:

    I've got an array of bytes, and I want to subtract the same number from all of the values in this array.

    So, what I want to do to speed this process up is to do multiple bytes at a time by loading them into a unsigned int, for example, so I can (in this case) work on 4 bytes at a time.

    I'm attempting to do this via bitwise addition (on ~n1+1) manipulation ie: (n1^n2)^((n1&n2)<<1) -- yes, I know this isn't the exact value...my routine works correctly...just trying to state what I am doing.

    When I subtract, I'm using (value*256*256*256)+(value*256*256)+(value*256)+va lue as the number to subtract from the four byte block.

    Anyway...the problem (of course) is that there are bits carried from one byte to the next.

    Is there any easy way to get around this...or should I just go byte by byte and subtract the value.

    Thanks!

  2. #2
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Let me give my comments before actually answering the question:
    Frankly, what is fast and slow depends on the CPU. The CPU will have some Arithmetic Logic Units that can do subtractions, multiplications etc etc.

    Now, it depends on the minimum unit the ALU supports. If it supports a 32 bit unit only then if you pass it a byte it will simply fill up the rest of the bytes, for example.

    So, assuming that the ALU supports 32bit (an int lets say) you want to take advantage of this when you subtract and pass 4 bytes. Fine until now. The tricky part is this: is the an additional cost to do so? If there is then you are gaining some time, but losing some other. Typically you want gain anything in the end. Might be slower. Will make the compiler optimization process more difficult and any other kind of optimization as well.

    What I mean is this.
    Code:
    char value[4];
    ....
    int optValue = (int)value[3] << 12 | (int)value[2] << 8 | (int)value[1] << 4 | (int)value[0];
    this is nice, but the bit shifts, the dereference, the casting all this might take more time than the actual subtraction.

    Of course you wouldn't do the above, you would do
    Code:
    char value[4];
    int* optValue = (int*)value;
    But what about when the number you subtract from this?
    If you do as before
    Code:
    char value;
    ....
    int optValue = (int)value << 12 | (int)value << 8 | (int)value << 4 | (int)value;
    Then you will again lose cycles and it might really not be worth it.
    The best way would be to hardcode all values from 0 to 255. So
    Code:
    #define B_255 0xFFFFFFFF
    #define B_254 0xFEFEFEFE
    ....
    then subtract normally.
    To use them you should do
    Code:
    char* value = (char*)optValue;
    // value[2] would be the second byte
    which I guess you already do.

    Note that endianess might get into play and ruin your attempt. This of course can be solved.

    Now the actual question:
    There is partial way around this. Use 2 bytes per integer. Then you will have
    #junk#value1#junk#value2
    So you have 8 bits of junk that will have the value of the carriers, which you don't care about.

  3. #3
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    If you're willing to switch into assembler you might be able to use the MMX instruction set. The PADDB instruction will do 64 bits (8 bytes) or 128 bits (16 bytes) at a time and the bytes don't spill over to their neighbors.

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Before you look into these kind of optimizations, look VERY carefully at your overall algorithm. THAT's where the BIG improvements might be waiting to be discovered.

    Then get a good test set of data, that fairly represents your typical data, and get a base time for that run, just using standard C addition and subtraction. A run through a profiler would be great to have at this time.

    Now, start testing your optimizations, one at time, on that same test set. Take the functions you believe you can optimize, in descending order of the time they use, in the profiler.

    There's no sense in spending hours to save, at most, a few hundredths of a second, at most.

    If you're doing all the above already, great.

  5. #5
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Indeed. This problem is extremely well suited to MMX. It's practically exactly what it was made for.
    I've done several routines like this in MMX and SSE2. What's the code you have now and what compiler is this for?
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  6. #6
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Quote Originally Posted by nonoob View Post
    If you're willing to switch into assembler you might be able to use the MMX instruction set. The PADDB instruction will do 64 bits (8 bytes) or 128 bits (16 bytes) at a time and the bytes don't spill over to their neighbors.
    Assuming your CPU supports them.

    This might or might not give improved performance. You are generally competing with the compiler when you do so. Compilers have to try and use all the features of a CPU, try to send the commands in a specific order to utilize things like pipelines or any parallel process logic. That doesn't mean of course that using assembly won't give you a boost performance, since you don't have a command like PADDB in C. If the compiler will use this command for you is unknown.

    You can try it out though. Use assembly code in your program or take the assembly code that is created from your compiler and change the section that adds the bytes.

    I would bet that you would actually gain some speedup since this command suits perfectly what you want to do.

  7. #7
    Registered User
    Join Date
    Jan 2009
    Posts
    1,485
    After optimizing the algorithm, don't forget the optimize flags of your compiler. You might find that what comes out that is hard to beat. Other than that you could try the -S flag (assuming GCC) to inspect the assembly generated by the compiler, perhaps you can spot things you can optimize further.
    Last edited by Subsonics; 08-20-2010 at 06:01 PM.

  8. #8
    Registered User
    Join Date
    Aug 2010
    Posts
    4
    I'm going to look at using some inline assembly for this (thanks nonoob). A learning process for me as I've never done something like this before, but that's a good thing because even at 44, it's always a good thing to learn something new. It's been a long time since I've worked in assembly, and never worked with MMX (which I have available), so I didn't know something like PADDB existed.

    PADDB looks to be the trick and being able to do 8 or even 16 bytes at a time would speed things up greatly, I *think*. I'll keep you posted on my progress.

    This is just a pet project for me, not something for work or anything, so spending some extra time on it isn't a problem.

  9. #9
    Registered User
    Join Date
    Aug 2010
    Posts
    4
    I've got it coded...took a little while doing research on google for the correct syntax for inline asm, and for the mmx commands...

    Now I just need to get some input so I can test it for speed. Will report back.

  10. #10
    Registered User
    Join Date
    Aug 2010
    Posts
    4
    Oh yes...*MUCH* faster...I'm just doing 64-bit adds instead of 128-bit...it's almost twice as fast. Very awesome...thanks again nonoob!! I wouldn't have ever thought of that.

    Glad I joined here...hopefully I can get other good ideas and lend some help along the way as well.

  11. #11
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Thanks, tomservo, for keeping us up to date. I'd like to know what it's for... but that's outside the scope of this forum. Send me a message if you want. Anyway, I'm glad it worked out.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. from 2D array to 1D array
    By cfdprogrammer in forum C Programming
    Replies: 17
    Last Post: 03-24-2009, 10:33 AM
  2. Replies: 7
    Last Post: 11-25-2008, 01:50 AM
  3. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  4. Quick question about SIGSEGV
    By Cikotic in forum C Programming
    Replies: 30
    Last Post: 07-01-2004, 07:48 PM
  5. Array Program
    By emmx in forum C Programming
    Replies: 3
    Last Post: 08-31-2003, 12:44 AM