Fast(est) way to manipulate memory

**DrSnuggles** · 03-13-2011

Hi, I'm trying to get updating the pixels of an image as fast as possible. I'm wondering if there is some C++ magic I can apply to make it faster. Or some other kind of magic...

Right now I'm basically doing like below. The actual loop I use has a more calculations in it.

Code:

for (int y=0;y<height;y++)
{
    int yval = width * y;
    for (int x=0;x<width;x++)
    {
        COLOR* color1 = &colors1[yval + x];
        COLOR* color2 = &colors2[yval + x];
        color1->r = color2->r;
        color1->g = color2->g;
        color1->b = color2->b;
    }
}

I'm looking for something different to make this blazing fast, perhaps something lower level. Do you have any suggestions?

I tested switching to using pointer arithmetic for the color array access but that didn't make much difference.

Right now it takes about 15 milliseconds to update a 2048x2048 image including a few calculations in my loop. It is not a whole lot but if I need to do it several times in one code cycle which I often do then one quickly approaches 1/10th of a second where lag can start to be noticable.

**grumpy** · 03-13-2011

That depends on what COLOR is.

**Elkvis** · 03-13-2011

if COLOR is a simple struct with r, g, b members that are POD types, then the code you posted will likely produce performance that is virtually indistinguishable from other methods (inline asm, etc). if you enable optimizations in the compiler, it will take the code you wrote, which is pretty straightforward, and already optimized pretty well, and generate really fast machine code

**grumpy** · 03-13-2011

Originally Posted by Elkvis

if COLOR is a simple struct with r, g, b members that are POD types, then the code you posted will likely produce performance that is virtually indistinguishable from other methods (inline asm, etc).

If COLOR is a POD type, and it's only members are r,g,b (which must also be POD types, by definition) then the inner loop may be replaced by a single call to memcpy() - which will almost certainly not be outperformed by the loop (as a number of modern compilers inline memcpy() calls) and the code would be simpler.

**iMalc** · 03-13-2011

Unless the above code takes the same amount of time to execute as your real code, then you need help optimising the whole thing. However unfortunately you've possibly left out the bits where we can make the most gains.
You need to post the whole relevant section as it appears in your actual code, and we need everything to compile it ourselves, e.g. the COLOR definition etc.

**Salem** · 03-14-2011

How did you declare colors1 and colors2?

If they're actual 2D arrays that are in scope, then simplify your code to

Code:

for (int y=0;y<height;y++)
{
    for (int x=0;x<width;x++)
    {
        colors1[y][x].r = colors2[y][x].r;
        colors1[y][x].g = colors2[y][x].g;
        colors1[y][x].b = colors2[y][x].b;
    }
}

The base address of an array in scope arrives as part of the instruction fetch. The compiler also knows it is a const address, so it has more scope for sub-expression optimisation.
Using a pointer means a) one more memory cycle to dereference the pointer, and b) less scope for optimisation of const sub-expressions.
Setting a pointer to the start of an array is one of those 1980's pseudo-hacks that somehow refuses to die off (much like the xor swap nonsence).

If all you have is a pointer, then you may as well make use of the fact that the whole block of memory is contiguous.

Code:

COLOR* color1 = colors1;
COLOR* color2 = colors2;
for (int y=0;y<height;y++)
{
    for (int x=0;x<width;x++,color1++,color2++)
    {
        color1->r = color2->r;
        color1->g = color2->g;
        color1->b = color2->b;
    }
}

If there is nothing else (apart from your yval calculation), you can flatten this into a single loop running from 0 to width*height.

**DrSnuggles** · 03-14-2011

Thanks for the replies. Yes I should have included what the COLOR class is, it is a very basic 4 byte structure and looks like this:

Code:

class COLOR
{
public:
	BYTE b;
	BYTE g;
	BYTE r;
	BYTE a;
	COLOR() {r = g = b = a = 0;}
	COLOR(int rr, int gg, int bb, int aa)
	{
		r = rr;
		g = gg;
		b = bb;
		a = aa;
	}
};

I create the color buffers in one contiguous block of memory like this:

Code:

colors1 = (COLOR *)malloc(4 * (width*height));

The loops cannot be collapsed into one because one of the buffers (Direct3D texture) is flipped in the y axis and I usually only update part of the whole texture.

I'll provide a more complete example of what I'm doing, I'm blending different layers into a Direct3D texture. I have set it to use pointer arithmetic now:

Code:

for (int i=0;i<numLayers;i++)
{
        //Setup info about layer rect and properties here

        for (int y=startY;y<endY;y++)
       {
	        int yval = layerWidth * (y - rectTop);
	        int yvalD3D = width * (heightMinusOne - y);
	        COLOR* col = layerColors + (yval + (startX - rectLeft));
	        COLOR* colD3D = colors + (yvalD3D + startX);

	        if (i) //Not background layer
	        {
		         int blend;
		         COLOR24 blendedCol;
		         BYTE* source;
		         BYTE* dest;
		         for (int x=startX;x<endX;x++, col++, colD3D++)
		         {
			        if (!col->a) continue;
			        blend = col->a * opacity;

			        if (blendMode) blendedCol = BlendColors(colD3D,col,blendMode);
			        else blendedCol = COLOR24(col->r,col->g,col->b);

			        if (blend == 255)
			        {
				        colD3D->r = blendedCol.r;
				        colD3D->g = blendedCol.g;
				        colD3D->b = blendedCol.b;
			        }
			        else
			        {
				        source = blendingTable[blend].row;
				        dest = blendingTable[255 - blend].row;
				        colD3D->r = source[blendedCol.r] + dest[colD3D->r];
				        colD3D->g = source[blendedCol.g] + dest[colD3D->g];
				        colD3D->b = source[blendedCol.b] + dest[colD3D->b];
			        }
		        }
	        }
	        else //Background layer
	        {
		         for (int x=startX;x<endX;x++, col++, colD3D++)
		        {
			        *colD3D = *col;
		        }
	        }
        }
}

The test time I posted is simply for updating the background layer though so as you can see there are not many other calculations in the loops then.

I'm not sure if things can be optimized, would using assembly for some parts be benefitial? I don't know much about it but I'm willing to learn if there could be significant gains.

**grumpy** · 03-14-2011

I doubt assembly would be beneficial in this case.

The fact you have a number of if() statements nested in your loops - and those if() statements are testing things computed in outer loops - says there is plenty of scope for restructuring your code.

For example, you have an outer loop iterating over i, and - within an inner loop - you are doing different things if i is zero versus non-zero. Restructuring your code to avoid any such if() tests will be beneficial.

A rule of thumb: get your algorithm as efficient as possible before looking towards assembler. In practice, simplifying the flow structure of your code (eg reducing number of conditional branches) often correlates with greater efficiency - and your flow structure is unnecessarily complex.

Incidentally, since you have a C++ class, you would be better off using operator new rather than malloc().

**DrSnuggles** · 03-14-2011

grumpy, I have tested to remove all the conditional statements (only one for the background layer) and the time is pretty much the same so it is obvious that it is the accessing and assigning of memory that is taking the meat part of the time. That is what I'd like to speed up somehow.

I'll be happy to accept assembly example code if you can provide.

It might be that this code section is not a good candidate for that though.

**Elysia** · 03-14-2011

Why are you using malloc in the first place?
Also, do not simply remove the if statements and claim it has the same performance. Use a profiler, and run several runs to be sure.

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

Why are you using malloc in the first place?
Also, do not simply remove the if statements and claim it has the same performance. Use a profiler, and run several runs to be sure.

When I tested I put the if statement outside the loops and did the background layer seperately. The timing was pretty much the same. One if statement did not change much. I'm not asking you to optimize the whole of my code, I'm wondering about this specific issue of handling memory.

I'm using malloc because it does not call the constructor of the class like "new" does.

**grumpy** · 03-14-2011

As I said, I don't believe hand-crafted assember will help in your case.

My point is that the control structure you have makes it harder to squeeze out performance. True, simplifying the code structure will not necessarily optimise your code. But it will make it easier to identify the parts that actually need to be optimised.

Your code has some operations that only get executed on one iteration of an outer loop, mixed with other operations that get executed on every iteration of that outer loop. That makes it harder to identify which operations are the performance bottleneck.

Basically, I consider that you are performing premature optimisation. If you run a profiler on your code, to identify the hot spots (i.e. those that consume CPU cycles) you will be better able to identify what needs to be optimised. But one hint: the output of profilers is less useful when control flow is unnecessarily complex.

**Elysia** · 03-14-2011

Originally Posted by DrSnuggles

I'm using malloc because it does not call the constructor of the class like "new" does.

Then do not use a class! You are invoking undefined behavior here, which never is a good thing if you can avoid it.
Also, this is possible premature optimization.

**DrSnuggles** · 03-14-2011

Originally Posted by grumpy

As I said, I don't believe hand-crafted assember will help in your case.

My point is that the control structure you have makes it harder to squeeze out performance. True, simplifying the code structure will not necessarily optimise your code. But it will make it easier to identify the parts that actually need to be optimised.

Your code has some operations that only get executed on one iteration of an outer loop, mixed with other operations that get executed on every iteration of that outer loop. That makes it harder to identify which operations are the performance bottleneck.

Basically, I consider that you are performing premature optimisation. If you run a profiler on your code, to identify the hot spots (i.e. those that consume CPU cycles) you will be better able to identify what needs to be optimised. But one hint: the output of profilers is less useful when control flow is unnecessarily complex.

I have already tested and timed pretty much all different scenarios and not found significant improvements. Like I said I have tested to run the code with just plain loops so the if statements are not the problem. I posted the surrounding code just to give you some context. It is the section where the background layer is handled that is relevant.

Originally Posted by Elysia

Then do not use a class! You are invoking undefined behavior here, which never is a good thing if you can avoid it.
Also, this is possible premature optimization.

No I'm not invoking undefined behavior if I make sure I fill in the buffers...

Answer the question of the thread please and don't rant about "premature optimization".

**Elysia** · 03-14-2011

Originally Posted by DrSnuggles

No I'm not invking undefined behavior if I make sure I fill in the buffers...

Answer the question of the thread please and don't rant about "premature optimization".

Yes, you are. A class + malloc = undefined behavior.
And I have already answered your question: use a profiler.

Thread: Fast(est) way to manipulate memory

Thread Tools

Search Thread

Display

Fast(est) way to manipulate memory

Similar Threads

Memory Fragmentation with Dynamic FIFO Queue

Still confused why working set larger than virtual memory

Question regarding Memory Leak

Memory problem with Borland C 3.1

Shared Memory - shmget questions