optimizing repetitive code

**R.Stiltskin** · 08-23-2008

Suppose a very simple snippet of code like this

Code:

for( counter = 0; counter < 3; ++counter ) {
    array[counter] = *(sourcePointer+counter);
}

is part of a loop that runs many thousands of times during a function's execution.
int counter;
int array[3];
int* sourcePointer;
are declared elsewhere, and sourcePointer is a different address on each iteration.

Would this run faster if written as

Code:

array[0] = *sourcePointer;
array[1] = *(sourcePointer+1);
array[2] = *(sourcePointer+2);

eliminating incrementing and testing the counter variable?
Or do compilers do this anyway?

(For real-time image analysis on low-powered equipment, so I'm curious about even very tiny savings.)

**matsp** · 08-23-2008

That would very much depend on the compiler. gcc 3.4 doesn't figure out that it's possible to replace the loop with three separate statements. Microsoft visual studio .Net does.

Edit: Why do you copy it into an array in the first place? Is there any reason you can't just use sourcePointer[0], sourcePointer[1] and sourcePointer[2] directly in your code?
--
Mats

**cyberfish** · 08-23-2008

I thought loop unrolling has been in gcc for a while? (-O3)

**R.Stiltskin** · 08-23-2008

Originally Posted by matsp

Why do you copy it into an array in the first place? Is there any reason you can't just use sourcePointer[0], sourcePointer[1] and sourcePointer[2] directly in your code?
--
Mats

A couple of reasons (which I hope are valid). Firstly, sourcePointer is one of the function's arguments, pointing to a camera image, so each use of sourcePointer[x] will involve a memory call. The array is local to this function and so (I hope) likely to be cached.

Also, the arraysize is actually 9, and all of its elements are called many times on each iteration so if I wrote it with 9 individual variables I'm afraid the code would be completely unreadable even to me.

(To alleviate the mystery, the code I'm working on is a convolution.)

**matsp** · 08-23-2008

Originally Posted by cyberfish

I thought loop unrolling has been in gcc for a while? (-O3)

Quite possibly - but -O3 or -funroll-loops doesn't make it do what I expected. <... time passes ...> -funroll-all-loops does, however.

With -O3 -funroll-all-loops -fomit-frame-pointer, gcc and MS VC++ generates identical code (aside from exactly which register is being used).

--
Mats

**matsp** · 08-23-2008

Originally Posted by R.Stiltskin

A couple of reasons (which I hope are valid). Firstly, sourcePointer is one of the function's arguments, pointing to a camera image, so each use of sourcePointer[x] will involve a memory call. The array is local to this function and so (I hope) likely to be cached.

Also, the arraysize is actually 9, and all of its elements are called many times on each iteration so if I wrote it with 9 individual variables I'm afraid the code would be completely unreadable even to me.

(To alleviate the mystery, the code I'm working on is a convolution.)

All processors I'm aware of [which are quite a few] that have any form of cache, doesn't care whether the data is local or global. So as long as your camera image is stored in cacheable memory in itself [modern OS's usually have a way that you can allocate memory with caching attributes - since sometimes non-cacheable data is better for performance, since flushing out the contents of cacheable memory to a device (or in a camera case, invalidating the CPU caches for that area of memory when a new image has been read in) can have a large impact on overall performance]. Obviously, if the camera driver allocates non-cacheable memory, then making a copy will definitely benefit your application, as a local variable will be on the stack, and only complete idiot system developers would allocate the stack in non-cacheable memory.

There are other factors of course - making a local array means that it can be held in local registers over function calls that doesn't hold the address to that data, for example.

Edit: Be aware of premature optimization. Get the overall code working first. Tricks to "be more clever than the compiler" sometimes backfires, and can depend very much on the compiler - and the last thing you want, I suspect, is code that works REALLY GREAT on one compiler, and then compile it with another and it gives you much worse code than the non-optimal case.

--
Mats

**R.Stiltskin** · 08-23-2008

Thanks, I'll have to chew on that for a while. I think I'll just have to experiment with a few approaches to see if any of them yields a time savings.

I assume that the "-O3 -funroll-all-loops -fomit-frame-pointer" mentioned above are compiler flags, correct? Do your comments apply to gcc-3.3 as well? What's the effect of -fomit-frame-pointer? When time permits I can experiment with those as well. I may not be able to use those flags, though -- I'm interfacing with a lot of code that's essentially a black box to me & setting these flags may cause problems.

**matsp** · 08-23-2008

I'm not sure what version of gcc has which flags, but I'm pretty sure -funroll-loops and -fomit-frame-pointer is available in 3.3.

-funroll[-all]-loops should definitely not cause any interface problems. -fomit-frame-pointer can cause difficulties to debug the generated code, but otherwise shouldn't cause any compatibility.

The "frame-pointer" is a register that holds "the stack pointer as we entered this function", which makes it easier for a debugger to find the local variables, and easier to track back to the previous function. Using "-fno-frame-pointer" gives two benefits:
1. the compiler doesn't store/restore the frame pointer in each function.
2. the compiler has one more register to use for variables.

Like most optimizatins, the compiler is free to "not do it" if it finds suitable reasons.

Note that -funrol-alll-loops is not always beneficial to whole programs - it will very agressively unroll all sorts of loops, even when there is no actual benefit [this is a case of "you asked for it, so that's what you get" optimization]. This is why this option is not on by default even in -O3.

--
Mats

**iMalc** · 08-23-2008

Here's a quick refresher on what the [] operator does:
a[b] is the same as *(a + b).
Therefore you could more tidily write your code snippet as this:

Code:

for( counter = 0; counter < 3; ++counter ) {
    array[counter] = sourcePointer[counter];
}

Or perhaps you could simply use memcpy:

Code:

memcpy(array, sourcePointer, 3 * sizeof(array[0]));

Of course if these were objects you were copying, you'd have to use std::copy instead. In fact it's not a bad idea anyway:

Code:

std::copy(array, &sourcePointer[0], &sourcePointer[3]);

As often happens, I wont yet be giving you any furthur optimisation advice here because you aren't yet giving us the big picture of what you're doing. You'd have to post more than your one little loop. The whole function it's in would be a good start.

**foxman** · 08-23-2008

I didn't read the whole thread, but before someone say so, you are not calling std::copy (ok, std::copy is a template, not a function, but I guess most people here understand what I mean) the correct way, iMalc. This should be

Code:

std::copy(&sourcePointer[0], &sourcePointer[3], array);

**grumpy** · 08-23-2008

Originally Posted by foxman

Code:

std::copy(&sourcePointer[0], &sourcePointer[3], array);

Since, in the original post, sourcePointer is a raw pointer to int, and array is a C-style array of int, this can be simplified to;

Code:

std::copy(sourcePointer, sourcePointer + 3, array);

**iMalc** · 08-24-2008

Yeah I'm prone to getting things front to back.

**abachler** · 08-24-2008

Most modern compilers automaticalyl unroll loops, btu in general this is limited to loops with 4 or fewer iterations and some other conditions such as no way to potentially modify the loop variable inside the loop. Manually unrolling the loop will not hurt performance however, so if it might help go ahead. Unrolling extremely long loops however can hurt performance if the unrolled code does not fit entirely into the cache. It is common in these cases to use partial loop unrolling.

e.g.

Code:

for( counter = 0; counter < 4; ++counter ){
    array[counter] = *(sourcePointer+counter);
    }

becomes

Code:

for( counter = 0; counter < 4; counter+=2 ) {
    array[counter] = *(sourcePointer+counter);
    array[counter+1] = *(sourcePointer+counter+1);
    }

of course you wouldnt do a partial unroll for such a short loop, I just used that as an example. Generally this is useful where the number of iterations is evenly divisible by some whole number and the fully unrolled loop would exceed about 1KB or has >64 iterations.

Thread: optimizing repetitive code

Thread Tools

Search Thread

Display

optimizing repetitive code

Similar Threads

Proposal: Code colouring

Values changing without reason?

Optimizing my code

Obfuscated Code Contest

Interface Question