# Thread: Optimized clipping of floats/doubles ?

1. ## Optimized clipping of floats/doubles ?

Is there a way to do this more efficiently?
Code:
```inline void Clip( double &x, double a, double b ) // makes sure x is between a and b
{
if (x<a) x = a;
else if (x>b) x = b;
}```
Note that this involves lotsa fnstsw and conditional jumps, which suck.

I personally (msvc8@core2duo) gain almost 50% speed increase if I do this:
Code:
```inline void Clip( double &x, double a, double b ) // makes sure x is between a and b
{
x = b-x;
x = (x+fabs(x))*0.5; // this essentially does "if (x<0) x=0;"
x = b-x-a;
x = (x+fabs(x))*0.5;
x += a;
}```
No status word crap, and no checks anymore.

Any other tricks to make this faster?

2. Well that's about the smartest thing I've ever seen posted here.

3. You are kidding, right?

The second version takes about 6 times longer than the first one.

4. How do you measure, and what compiler/cpu?
I did a loop of a few million times, calling this function with several different cases in a row (and adding the result for dummy output, otherwise the compiler optimizes away everything ).

Note that the fabs() function translates directly to a single FPU instruction, whereas the compare involves storing the FPU status word, plus you have to do a conditional jump.

5. Originally Posted by Phuzzillogic
How do you measure, and what compiler/cpu?
I did a loop of a few million times, calling this function with several different cases in a row (and adding the result for dummy output, otherwise the compiler optimizes away everything ).
Indeed. After making sure that was not the case, I timed it myself on 10 million iterations and got 0.51 seconds for the if-statement version and 0.32 seconds for the fabs() version.

I used g++ 4.1.2 with -O3.

6. O_o

I'm thinking the second version would be best for general use anyway since you could have a reasonable guess (or know for a given processor) at how many clocks it would take for any invocation.

That said, both versions were in a few clocks of each other on my system--with the first version always winning.

"The second version takes about 6 times longer than the first one."

Did you just count the number of instructions you could see? (Maybe on an old Pentium 3?)

Soma

Edit: I ran a simple test with `std::clock()' passing off the variables as references to a "do nothing" function. (Where the "do nothing" function was compiled separately preventing G++ from completely removing the test.)

7. I'm guessing there isn't anything like CMOV for doubles.

8. Originally Posted by iMalc
I'm guessing there isn't anything like CMOV for doubles.
FCMOV - Wikipedia, the free encyclopedia

--
Mats

9. Here's a special case to clip between 0 and 1, using IEEE magic:
Code:
```inline void Clip01( const float *src, float *dst ) // *dst = *src clipped between 0 and 1
{
static const int ieeeOne = 0x3f800000;
int x = *(int*)(src);
x = ieeeOne - (x & ~(x>>(sizeof(x)*8-1)));
*(int*)dst = ieeeOne - (x & ~(x>>(sizeof(x)*8-1)));
}```

10. Casting floating point to integer and back again is sometimes not faster than doing the relevant code in floating point. This is because it disrupts the flow within the processor (forward floating point data to the integer unit and the other way around). This sort of thing is fine if you do this in a loop that ONLY does clipping on a large array, but if you mix it with other math on the input variables, then it can be quite bad for performance.

It is most likely more efficient if we can produce SSE instructions and use the max/min instructions.

--
Mats

11. To those inquiring about my timing results:

Yes, I looped throught 5 million calls to either version - all the obvious things (I've been at this 28 years, so I'm familiar with optimization issues on such tests )

The first version of the inline consistently took 1/6th the time of the second version.

Visual studio 2008, unmanaged build, results tested on AMD x2 @ 3ghz, Intel q6600 at stock speed, XP 64 in 32 and 64 bit targets.

Also, the assembler output, viewed using the disassembly view on the release build, shows considerable complication of the output for version 2. Optimization set set to favor speed.

I can't see how the first version can be slower based on what I see; how are you getting the results you are?

Are you expecting the multiplications to evaporate in some optimizations?

I'm on a cheap laptop right now, can't open the test project until tonight....I'm curious as to how it is even possible the second version could be faster.

12. Originally Posted by someone
Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8. Letīs see how it behaves with several SSE2 instructions.
Intel Core 2 Duo - Test - BeHardware

Your 'fast' code would seem to be benefiting from the alternate 'add' and 'mul' thing you've got going on in the code.
Plus, it's all bound to the FPU (unlike your first example).

It's also highly dependent on having the right hardware to play on. As has already been commented, others will see a much worse performance.