Thread: Fast(est) way to manipulate memory

  1. #1
    Registered User
    Join Date
    Oct 2007
    Posts
    166

    Fast(est) way to manipulate memory

    Hi, I'm trying to get updating the pixels of an image as fast as possible. I'm wondering if there is some C++ magic I can apply to make it faster. Or some other kind of magic...

    Right now I'm basically doing like below. The actual loop I use has a more calculations in it.
    Code:
    for (int y=0;y<height;y++)
    {
        int yval = width * y;
        for (int x=0;x<width;x++)
        {
            COLOR* color1 = &colors1[yval + x];
            COLOR* color2 = &colors2[yval + x];
            color1->r = color2->r;
            color1->g = color2->g;
            color1->b = color2->b;
        }
    }
    I'm looking for something different to make this blazing fast, perhaps something lower level. Do you have any suggestions?

    I tested switching to using pointer arithmetic for the color array access but that didn't make much difference.

    Right now it takes about 15 milliseconds to update a 2048x2048 image including a few calculations in my loop. It is not a whole lot but if I need to do it several times in one code cycle which I often do then one quickly approaches 1/10th of a second where lag can start to be noticable.
    Last edited by DrSnuggles; 03-13-2011 at 04:44 PM.

  2. #2
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    That depends on what COLOR is.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  3. #3
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    if COLOR is a simple struct with r, g, b members that are POD types, then the code you posted will likely produce performance that is virtually indistinguishable from other methods (inline asm, etc). if you enable optimizations in the compiler, it will take the code you wrote, which is pretty straightforward, and already optimized pretty well, and generate really fast machine code

  4. #4
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Quote Originally Posted by Elkvis View Post
    if COLOR is a simple struct with r, g, b members that are POD types, then the code you posted will likely produce performance that is virtually indistinguishable from other methods (inline asm, etc).
    If COLOR is a POD type, and it's only members are r,g,b (which must also be POD types, by definition) then the inner loop may be replaced by a single call to memcpy() - which will almost certainly not be outperformed by the loop (as a number of modern compilers inline memcpy() calls) and the code would be simpler.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  5. #5
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Unless the above code takes the same amount of time to execute as your real code, then you need help optimising the whole thing. However unfortunately you've possibly left out the bits where we can make the most gains.
    You need to post the whole relevant section as it appears in your actual code, and we need everything to compile it ourselves, e.g. the COLOR definition etc.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    How did you declare colors1 and colors2?

    If they're actual 2D arrays that are in scope, then simplify your code to
    Code:
    for (int y=0;y<height;y++)
    {
        for (int x=0;x<width;x++)
        {
            colors1[y][x].r = colors2[y][x].r;
            colors1[y][x].g = colors2[y][x].g;
            colors1[y][x].b = colors2[y][x].b;
        }
    }
    The base address of an array in scope arrives as part of the instruction fetch. The compiler also knows it is a const address, so it has more scope for sub-expression optimisation.
    Using a pointer means a) one more memory cycle to dereference the pointer, and b) less scope for optimisation of const sub-expressions.
    Setting a pointer to the start of an array is one of those 1980's pseudo-hacks that somehow refuses to die off (much like the xor swap nonsence).

    If all you have is a pointer, then you may as well make use of the fact that the whole block of memory is contiguous.
    Code:
    COLOR* color1 = colors1;
    COLOR* color2 = colors2;
    for (int y=0;y<height;y++)
    {
        for (int x=0;x<width;x++,color1++,color2++)
        {
            color1->r = color2->r;
            color1->g = color2->g;
            color1->b = color2->b;
        }
    }
    If there is nothing else (apart from your yval calculation), you can flatten this into a single loop running from 0 to width*height.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Oct 2007
    Posts
    166
    Thanks for the replies. Yes I should have included what the COLOR class is, it is a very basic 4 byte structure and looks like this:
    Code:
    class COLOR
    {
    public:
    	BYTE b;
    	BYTE g;
    	BYTE r;
    	BYTE a;
    	COLOR() {r = g = b = a = 0;}
    	COLOR(int rr, int gg, int bb, int aa)
    	{
    		r = rr;
    		g = gg;
    		b = bb;
    		a = aa;
    	}
    };
    I create the color buffers in one contiguous block of memory like this:
    Code:
    colors1 = (COLOR *)malloc(4 * (width*height));
    The loops cannot be collapsed into one because one of the buffers (Direct3D texture) is flipped in the y axis and I usually only update part of the whole texture.

    I'll provide a more complete example of what I'm doing, I'm blending different layers into a Direct3D texture. I have set it to use pointer arithmetic now:
    Code:
    for (int i=0;i<numLayers;i++)
    {
            //Setup info about layer rect and properties here
    
            for (int y=startY;y<endY;y++)
           {
    	        int yval = layerWidth * (y - rectTop);
    	        int yvalD3D = width * (heightMinusOne - y);
    	        COLOR* col = layerColors + (yval + (startX - rectLeft));
    	        COLOR* colD3D = colors + (yvalD3D + startX);
    
    	        if (i) //Not background layer
    	        {
    		         int blend;
    		         COLOR24 blendedCol;
    		         BYTE* source;
    		         BYTE* dest;
    		         for (int x=startX;x<endX;x++, col++, colD3D++)
    		         {
    			        if (!col->a) continue;
    			        blend = col->a * opacity;
    
    			        if (blendMode) blendedCol = BlendColors(colD3D,col,blendMode);
    			        else blendedCol = COLOR24(col->r,col->g,col->b);
    
    			        if (blend == 255)
    			        {
    				        colD3D->r = blendedCol.r;
    				        colD3D->g = blendedCol.g;
    				        colD3D->b = blendedCol.b;
    			        }
    			        else
    			        {
    				        source = blendingTable[blend].row;
    				        dest = blendingTable[255 - blend].row;
    				        colD3D->r = source[blendedCol.r] + dest[colD3D->r];
    				        colD3D->g = source[blendedCol.g] + dest[colD3D->g];
    				        colD3D->b = source[blendedCol.b] + dest[colD3D->b];
    			        }
    		        }
    	        }
    	        else //Background layer
    	        {
    		         for (int x=startX;x<endX;x++, col++, colD3D++)
    		        {
    			        *colD3D = *col;
    		        }
    	        }
            }
    }
    The test time I posted is simply for updating the background layer though so as you can see there are not many other calculations in the loops then.

    I'm not sure if things can be optimized, would using assembly for some parts be benefitial? I don't know much about it but I'm willing to learn if there could be significant gains.
    Last edited by DrSnuggles; 03-14-2011 at 03:06 AM.

  8. #8
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    I doubt assembly would be beneficial in this case.

    The fact you have a number of if() statements nested in your loops - and those if() statements are testing things computed in outer loops - says there is plenty of scope for restructuring your code.

    For example, you have an outer loop iterating over i, and - within an inner loop - you are doing different things if i is zero versus non-zero. Restructuring your code to avoid any such if() tests will be beneficial.

    A rule of thumb: get your algorithm as efficient as possible before looking towards assembler. In practice, simplifying the flow structure of your code (eg reducing number of conditional branches) often correlates with greater efficiency - and your flow structure is unnecessarily complex.

    Incidentally, since you have a C++ class, you would be better off using operator new rather than malloc().
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  9. #9
    Registered User
    Join Date
    Oct 2007
    Posts
    166
    grumpy, I have tested to remove all the conditional statements (only one for the background layer) and the time is pretty much the same so it is obvious that it is the accessing and assigning of memory that is taking the meat part of the time. That is what I'd like to speed up somehow.

    I'll be happy to accept assembly example code if you can provide.

    It might be that this code section is not a good candidate for that though.
    Last edited by DrSnuggles; 03-14-2011 at 04:21 AM.

  10. #10
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Why are you using malloc in the first place?
    Also, do not simply remove the if statements and claim it has the same performance. Use a profiler, and run several runs to be sure.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    Registered User
    Join Date
    Oct 2007
    Posts
    166
    Quote Originally Posted by Elysia View Post
    Why are you using malloc in the first place?
    Also, do not simply remove the if statements and claim it has the same performance. Use a profiler, and run several runs to be sure.
    When I tested I put the if statement outside the loops and did the background layer seperately. The timing was pretty much the same. One if statement did not change much. I'm not asking you to optimize the whole of my code, I'm wondering about this specific issue of handling memory.

    I'm using malloc because it does not call the constructor of the class like "new" does.
    Last edited by DrSnuggles; 03-14-2011 at 04:29 AM.

  12. #12
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    As I said, I don't believe hand-crafted assember will help in your case.

    My point is that the control structure you have makes it harder to squeeze out performance. True, simplifying the code structure will not necessarily optimise your code. But it will make it easier to identify the parts that actually need to be optimised.

    Your code has some operations that only get executed on one iteration of an outer loop, mixed with other operations that get executed on every iteration of that outer loop. That makes it harder to identify which operations are the performance bottleneck.

    Basically, I consider that you are performing premature optimisation. If you run a profiler on your code, to identify the hot spots (i.e. those that consume CPU cycles) you will be better able to identify what needs to be optimised. But one hint: the output of profilers is less useful when control flow is unnecessarily complex.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  13. #13
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by DrSnuggles View Post
    I'm using malloc because it does not call the constructor of the class like "new" does.
    Then do not use a class! You are invoking undefined behavior here, which never is a good thing if you can avoid it.
    Also, this is possible premature optimization.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  14. #14
    Registered User
    Join Date
    Oct 2007
    Posts
    166
    Quote Originally Posted by grumpy View Post
    As I said, I don't believe hand-crafted assember will help in your case.

    My point is that the control structure you have makes it harder to squeeze out performance. True, simplifying the code structure will not necessarily optimise your code. But it will make it easier to identify the parts that actually need to be optimised.

    Your code has some operations that only get executed on one iteration of an outer loop, mixed with other operations that get executed on every iteration of that outer loop. That makes it harder to identify which operations are the performance bottleneck.

    Basically, I consider that you are performing premature optimisation. If you run a profiler on your code, to identify the hot spots (i.e. those that consume CPU cycles) you will be better able to identify what needs to be optimised. But one hint: the output of profilers is less useful when control flow is unnecessarily complex.
    I have already tested and timed pretty much all different scenarios and not found significant improvements. Like I said I have tested to run the code with just plain loops so the if statements are not the problem. I posted the surrounding code just to give you some context. It is the section where the background layer is handled that is relevant.

    Quote Originally Posted by Elysia View Post
    Then do not use a class! You are invoking undefined behavior here, which never is a good thing if you can avoid it.
    Also, this is possible premature optimization.
    No I'm not invoking undefined behavior if I make sure I fill in the buffers...

    Answer the question of the thread please and don't rant about "premature optimization".
    Last edited by DrSnuggles; 03-14-2011 at 04:44 AM.

  15. #15
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by DrSnuggles View Post
    No I'm not invking undefined behavior if I make sure I fill in the buffers...

    Answer the question of the thread please and don't rant about "premature optimization".
    Yes, you are. A class + malloc = undefined behavior.
    And I have already answered your question: use a profiler.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Memory Fragmentation with Dynamic FIFO Queue
    By fguy817817 in forum Linux Programming
    Replies: 17
    Last Post: 10-31-2009, 04:17 AM
  2. Replies: 4
    Last Post: 01-13-2008, 02:14 AM
  3. Question regarding Memory Leak
    By clegs in forum C++ Programming
    Replies: 29
    Last Post: 12-07-2007, 01:57 AM
  4. Memory problem with Borland C 3.1
    By AZ1699 in forum C Programming
    Replies: 16
    Last Post: 11-16-2007, 11:22 AM
  5. Shared Memory - shmget questions
    By hendler in forum C Programming
    Replies: 1
    Last Post: 11-29-2005, 02:15 AM