It doesn't matter if you time it. All you know is that it's slower or faster. But you don't know WHAT is faster or slower.
You need to find the hotspots in your code before optimizing. As we have said.
How do we do that? We use a profiler.
After we make a change, THEN we time.
You are asking me to profile my whole program? That is not what I'm asking for. How is timing how long something takes not profiling? I have profiled and come to the conclusion that it is the part in the initial post that would benefit the most from being faster in this function.
It is sufficient. If the whole loop takes as an example 16 milliseconds and when I run only the assigning memory part it takes 15 milliseconds then I'm smart enough to figure out what is taking the most time.
Thing also is that my program is a plugin and I do not have a debug version of the main program so I only do release builds. I'm thinking profilers need debug builds.
Last edited by DrSnuggles; 03-14-2011 at 05:33 AM.
Yes in my example I'm only doing one thing, assigning memory. Obviously the loop cases where I'm blending colors will take longer than just assigning the colors but that is a separate issue. In this perticular thread I'm wondering if something can be done to improve the speed of memory assignment/access.
> Right now it takes about 15 milliseconds to update a 2048x2048 image including a few calculations in my loop.
Which equates to 800MB/Sec, just for assigning RGB values.
How close is this to the sustained memory throughput of your system? Unless it's like less than 10%, then writing it in asm isn't going to get past this underlying physical reality.
You mentioned later that you only update parts of the image. The best optimisation you can make is to not do something at all.
Are your sub-rectangles as small as possible?
Do they all change at the same rate? Is there any possibility to cache partial blends of the less frequently changing stuff?
It's like trying to optimise bubble sort by focussing on strcpy. Unless you understand the full context of the code you're trying to improve, focussing on the bit in the middle won't do you a lot of good.
Since your struct is basically 4 bytes, consider using bitwise operators to unpack and pack RGB values into a 32-bit number. It's more code fiddle, but the relatively slow memory accesses (compared to register access times) would be a single long-word read/write rather than 6 bytes.
Whether this has any effect depends on how well your L1/L2 cache manages byte accesses.
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
If at first you don't succeed, try writing your phone number on the exam paper.
That could very well be the case. Not sure about what memory bandwidth I have. It would be interesting to try asm though but perhaps it is not trivial. Perhaps I can go bug some people in an assembly forum.
Yes for the most parts it is fairly optimized. I have to keep the memory imprint as low as possible so storing values per pixel quickly eats up memory but you gave me an idea about possibly precalculating some blending values per layer.You mentioned later that you only update parts of the image. The best optimisation you can make is to not do something at all.
Are your sub-rectangles as small as possible?
Do they all change at the same rate? Is there any possibility to cache partial blends of the less frequently changing stuff?
It is all my code which I have profiled a lot so I know where the performance bottlenecks are. This perticular function is one of the most costly and the assigning of memory for large images is costly.It's like trying to optimise bubble sort by focussing on strcpy. Unless you understand the full context of the code you're trying to improve, focussing on the bit in the middle won't do you a lot of good.
What do you mean 6 byte read/write? I'm only using 4.Since your struct is basically 4 bytes, consider using bitwise operators to unpack and pack RGB values into a 32-bit number. It's more code fiddle, but the relatively slow memory accesses (compared to register access times) would be a single long-word read/write rather than 6 bytes.
Whether this has any effect depends on how well your L1/L2 cache manages byte accesses.
edit: I see what you mean but in the cases where I don't need to assign rgba values seperatly I just transfer the 4 bytes in one go as you see in the second example. In the cases where I do set rgba values seperatley it might be better to fill in an inbetween variable and assign to memory once but I doubt it. I'll try that.
Last edited by DrSnuggles; 03-14-2011 at 07:20 AM.