Fast(est) way to manipulate memory

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

Yes, you are. A class + malloc = undefined behavior.
And I have already answered your question: use a profiler.

I have said I have timed different versions of the code, you are just going to have to take my word for it. Go with the example in the initial post if you find the if statements in the other example too horrific

.

**Elysia** · 03-14-2011

It doesn't matter if you time it. All you know is that it's slower or faster. But you don't know WHAT is faster or slower.
You need to find the hotspots in your code before optimizing. As we have said.
How do we do that? We use a profiler.

After we make a change, THEN we time.

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

It doesn't matter if you time it. All you know is that it's slower or faster. But you don't know WHAT is faster or slower.
You need to find the hotspots in your code before optimizing. As we have said.
How do we do that? We use a profiler.

After we make a change, THEN we time.

Well if I'm only doing one thing in the loop that that is what is slow. Like I said I have run the basic loop separately and found it takes too much time.

**Elysia** · 03-14-2011

If you're not going to listen, then good luck on your own.

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

If you're not going to listen, then good luck on your own.

You are asking me to profile my whole program? That is not what I'm asking for. How is timing how long something takes not profiling? I have profiled and come to the conclusion that it is the part in the initial post that would benefit the most from being faster in this function.

**Elysia** · 03-14-2011

Why so you think your manual timing is better than a profiler?

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

Why so you think your manual timing is better than a profiler?

It is sufficient. If the whole loop takes as an example 16 milliseconds and when I run only the assigning memory part it takes 15 milliseconds then I'm smart enough to figure out what is taking the most time.

Thing also is that my program is a plugin and I do not have a debug version of the main program so I only do release builds. I'm thinking profilers need debug builds.

**Elysia** · 03-14-2011

Profilers do not need debug builds, though it certainly makes it easier.
Also, do you know what part of the loop takes the most time?

**DrSnuggles** · 03-14-2011

Originally Posted by Elysia

Profilers do not need debug builds, though it certainly makes it easier.
Also, do you know what part of the loop takes the most time?

Yes in my example I'm only doing one thing, assigning memory. Obviously the loop cases where I'm blending colors will take longer than just assigning the colors but that is a separate issue. In this perticular thread I'm wondering if something can be done to improve the speed of memory assignment/access.

**Salem** · 03-14-2011

> Right now it takes about 15 milliseconds to update a 2048x2048 image including a few calculations in my loop.
Which equates to 800MB/Sec, just for assigning RGB values.
How close is this to the sustained memory throughput of your system? Unless it's like less than 10%, then writing it in asm isn't going to get past this underlying physical reality.

You mentioned later that you only update parts of the image. The best optimisation you can make is to not do something at all.
Are your sub-rectangles as small as possible?
Do they all change at the same rate? Is there any possibility to cache partial blends of the less frequently changing stuff?

It's like trying to optimise bubble sort by focussing on strcpy. Unless you understand the full context of the code you're trying to improve, focussing on the bit in the middle won't do you a lot of good.

Since your struct is basically 4 bytes, consider using bitwise operators to unpack and pack RGB values into a 32-bit number. It's more code fiddle, but the relatively slow memory accesses (compared to register access times) would be a single long-word read/write rather than 6 bytes.
Whether this has any effect depends on how well your L1/L2 cache manages byte accesses.

**DrSnuggles** · 03-14-2011

Originally Posted by Salem

> Right now it takes about 15 milliseconds to update a 2048x2048 image including a few calculations in my loop.
Which equates to 800MB/Sec, just for assigning RGB values.
How close is this to the sustained memory throughput of your system? Unless it's like less than 10%, then writing it in asm isn't going to get past this underlying physical reality.

That could very well be the case. Not sure about what memory bandwidth I have. It would be interesting to try asm though but perhaps it is not trivial. Perhaps I can go bug some people in an assembly forum.

You mentioned later that you only update parts of the image. The best optimisation you can make is to not do something at all.
Are your sub-rectangles as small as possible?
Do they all change at the same rate? Is there any possibility to cache partial blends of the less frequently changing stuff?

Yes for the most parts it is fairly optimized. I have to keep the memory imprint as low as possible so storing values per pixel quickly eats up memory but you gave me an idea about possibly precalculating some blending values per layer.

It's like trying to optimise bubble sort by focussing on strcpy. Unless you understand the full context of the code you're trying to improve, focussing on the bit in the middle won't do you a lot of good.

It is all my code which I have profiled a lot so I know where the performance bottlenecks are. This perticular function is one of the most costly and the assigning of memory for large images is costly.

Since your struct is basically 4 bytes, consider using bitwise operators to unpack and pack RGB values into a 32-bit number. It's more code fiddle, but the relatively slow memory accesses (compared to register access times) would be a single long-word read/write rather than 6 bytes.
Whether this has any effect depends on how well your L1/L2 cache manages byte accesses.

What do you mean 6 byte read/write? I'm only using 4.

edit: I see what you mean but in the cases where I don't need to assign rgba values seperatly I just transfer the 4 bytes in one go as you see in the second example. In the cases where I do set rgba values seperatley it might be better to fill in an inbetween variable and assign to memory once but I doubt it. I'll try that.

Thread: Fast(est) way to manipulate memory

Thread Tools

Search Thread

Display

Similar Threads

Memory Fragmentation with Dynamic FIFO Queue

Still confused why working set larger than virtual memory

Question regarding Memory Leak

Memory problem with Borland C 3.1

Shared Memory - shmget questions