1. ## Optimizing code

I need some advice on speeding up my code:

Question:

is x = y / foo(bar) significantly slower than having foo return the reciprocal of its usual return value and changing the line to:

x = y * foo(bar) ?

Also, is there a significant speed difference between the operation 1 / x as opposed to y / x where y != 1?

Thanks

2. The answer would be dependent on which CPU you are using. At any rate, even if you got a speed increase, it would be very small, and in my opinion it would not be worth the loss of code readability.

Also keep in mind, that it's possible that sort of change could effect the precision of the result (I'm assuming you are doing floating point arithmetic here).

3. Yeah, I did some timing and the speed difference is negligible, on the order of 0.1%. Dang..

4. If the hardware is capable of calculating reciprocal faster than usual divide - which there are some hardware that has special microcode/hardware to solve that problem, then yes.

But normally, the reason for calculating 1/y and x * (1/y) is that y is (relatively speaking) a constant, so many x * (1/y) is faster than x / y if we only need to calculate 1/y once per loop, and then using multiply instead. If y changes every time, then there is little benefit in calculating 1/y and multiplying x * (1/y).

As an example:
AMD Family 10 (latest generation processors):
SSE2:
DIVPS = 20 clocks.
MULPS = 4 clocks.

X87:
FDIV = 16-24 clocks depending on precision.
FMUL= 11 clocks.

So in the above cases, you only need to do about 2 x * (1/y) where (1/y) is the same value [and can be precalculated], to make it faster.

--
Mats

5. On x86, floating point division is much slower than multiplication. But it usually does not matter because the CPU is able to pipeline other operations in parallel with the division operation.

While the FP divider unit is processing, other instructions that do not require the FPU can execute simultaneously. This can sometimes hide the latency of the division to the point of being negligible.

Having said that, the multiply-by-reciprocal-instead-of-divide trick is commonly used, and can definitely result in a speedup in some situations, especially if it's easy to compute the reciprocal without actually dividing.

If you are computing the reciprocal by doing a 1.0 / x, then you're still dividing and not really gaining anything.

EDIT: There is fast square root algorithm using Newton's method that actually computes 1.0 / sqrt(x) instead of sqrt(x). You must invert the result at the end. But the key to the trick is that in the Newton loop, there are no divisions, unlike in the direct Newton computation

6. That square root algorithm could help me a lot - do you know where I can find it, or it a library function?

Actually, a more specific question:

In my loop, I compute the square root of a double and cast it to an int like

x = (int) (sqrt(y) + 0.5). Since I cast it like this, I only care about the first decimal point of the square root. Do you have any suggestions for a quick way to get the square root of a double to one decimal place?

7. Understanding Quake’s Fast Inverse Square Root | BetterExplained
OK, it's 1/sqrt(x), but it might get you close enough in a quicker way than now.

8. Originally Posted by KBriggs
That square root algorithm could help me a lot - do you know where I can find it, or it a library function?

Actually, a more specific question:

In my loop, I compute the square root of a double and cast it to an int like

x = (int) (sqrt(y) + 0.5). Since I cast it like this, I only care about the first decimal point of the square root. Do you have any suggestions for a quick way to get the square root of a double to one decimal place?
What's the range of your double? If your y's are reasonably close together you could try a Taylor series expansion, or use the fast 1/ as Salem mentioned.

9. The double could be anything, but reasonably it won't be much more than say, 500.

How about something like this for a quick hack at getting 1 decimal accuracy using Newton's method. Right now I haven't done I/O stuff, I am just altering numbers as needed.

Code:
```#include <stdio.h>

int main()
{
double k,prevroot,nextroot,precision;
k = 29.5546743;
prevroot = 15;
nextroot = 0;
precision = 1.0;
while (precision > 0.05)
{
nextroot = (prevroot * prevroot + k) / (2 * prevroot);
precision = nextroot - prevroot;
if (precision < 0)
{
precision *= -1;
}
prevroot = nextroot;
}
printf("%f\n",nextroot);
}```
My question is that since Newton's method depends on the initial guess, what is a good initial guess for sqrt(k)? And would this be any faster than tradition methods?

Actually, scratch that, this seems to work (I just took it from the link):

Code:
```#include <stdio.h>

int main()
{
float x;
x = 29.456;
float xhalf;
xhalf = 0.5f * x;
int i;
i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
printf("%f\n",1.0 / x);
}```
I don't even begin to understand what's going on there, but thanks so much

And one last edit: That fastsqrt actually makes my code run 20% slower than the library sqrt function haha. Any other ideas?

10. > The double could be anything, but reasonably it won't be much more than say, 500.
That's small enough to consider a lookup table IMO.

Or at least a lookup table which spans say 95%+ of the expected values (very quick), then use a calculation for the odd-ball exceptions which remain outside the table range.

11. Eh, the program already uses quite a bit of memory, I don't know if a lookup table would help anything.

12. Originally Posted by KBriggs
And one last edit: That fastsqrt actually makes my code run 20% slower than the library sqrt function haha. Any other ideas?
The library sqrt() probably calls a native instruction. It used to be (at the time Quake was written) that this was slower than computing it using the trick listed above. Apparently, these days the performance is improved.

13. I would try the taylor expansion method but I don't know how

14. > Eh, the program already uses quite a bit of memory, I don't know if a lookup table would help anything.
The square root of 500 fits easily into a byte.
So it's a 500 byte lookup table.

Are you seriously that short of memory? Are you programming for some embedded device with only a meg (or less) of memory? Or a standard desktop with lots of memory.

Plus, if this is the only use of sqrt() in your code, you save some by not linking with the library code.

15. I guess I am not really that short haha. I'll look into it. I've never use one before, so I am not really sure how to implement it.