# issues in float division a/b

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 04-22-2008
George2
issues in float division a/b
Hello everyone,

If a and b are both float (or double, I think it does not matter?), and I want to calculate a/b. I am wondering whether there will be any issues which make the calculation result in-accurate compared with the calculation result by-hand (on a paper). :-)

For example, if a is too big, or if b is too small? If such issue exists, what is the best practices to calculate a/b?

George
• 04-23-2008
Magos
There's always a risk that a floating point operation results in loss due to the finite precision of computers. However I very much doubt you'll get problems with this, unless you work for NASA or something, where precision is cruical :).

In general you should always using double instead of float since internally on the (modern) processor they're treated the same. The only reason to choose float would be for storage reasons (like saving to a file), since it uses 4 instead of 8 bytes.
• 04-23-2008
George2
Thanks Magos,

Quote:

Originally Posted by Magos
There's always a risk that a floating point operation results in loss due to the finite precision of computers. However I very much doubt you'll get problems with this, unless you work for NASA or something, where precision is cruical :).

In general you should always using double instead of float since internally on the (modern) processor they're treated the same. The only reason to choose float would be for storage reasons (like saving to a file), since it uses 4 instead of 8 bytes.

I do not understand what means "they're treated the same", but size of storage is not the same -- "it uses 4 instead of 8 bytes"? What do you mean treated the same?

regards,
George
• 04-23-2008
Magos
When you use a float in a calculation the processor is actually using a double then "cut off" the bits that will not be used in the result.

A simple example:

You want to calculate: 0.12f + 3.45f (floats)
The processor actually calculates: 0.1200 + 3.4500 (doubles)
The result is: 3.5700 (double)
The result is cut off and presented to you: 3.57f (float)
• 04-23-2008
matsp
Also note that when dividing floating point, we divide the dividend's mantissa by the divisor's mantissa, then adjust the exponents by subtracting the divisor's exponent from the dividend's. And of course, sign is adjusted to match the combination of the sign's of the dividend and divisor (I believe XOR will perform that correctly).

So whether the numbers are small or large makes no difference to the actual precision of the result. Obviously, double having more bits meaning a more precise result can be produced.

Multiply is essentially the same, except of course it multiplies the two components mantissa and adds the exponents.

Much more sensitive to inputs is the add and subtract, since the two numbers have to be normalized (mantissas need to be aligned so that the "decimal point inside the mantissa" is in the same place), so adding or subtracting numbers that have a huge difference in magnitude will loose precision. Say we have 5 digits to work with (unrealistic but simple):
Code:

float a = 123.45;  -> 1.2345E2
float b = 0.102;    -> 1E-3
// align b to a -> b = 0.0010E2: the last 2 falls off the edge.
a -= b;
// a = 123.35, not 123.348

The calculation is inprecise because of the adjustment of the numbers.

Ths becomes most noticable when you add a small number from a large number, then subtract the original large number, e.g.:
Code:

float a = 2.0f;
float b = 0.00001f;
a += b;
a -= 2.0f;

if the compiler rounds that on each step [which it may not do], you may find that a is zero, rather than a small number (if there's enough zero's in the beginning of b - it may not be quite enough as it is above) .

Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.

--
Mats
• 04-23-2008
George2
Thanks Mats,

1.

2.

Any advice or best practices to solve the issue in add/minus? :-)

Quote:

Originally Posted by matsp
Also note that when dividing floating point, we divide the dividend's mantissa by the divisor's mantissa, then adjust the exponents by subtracting the divisor's exponent from the dividend's. And of course, sign is adjusted to match the combination of the sign's of the dividend and divisor (I believe XOR will perform that correctly).

So whether the numbers are small or large makes no difference to the actual precision of the result. Obviously, double having more bits meaning a more precise result can be produced.

Multiply is essentially the same, except of course it multiplies the two components mantissa and adds the exponents.

Much more sensitive to inputs is the add and subtract, since the two numbers have to be normalized (mantissas need to be aligned so that the "decimal point inside the mantissa" is in the same place), so adding or subtracting numbers that have a huge difference in magnitude will loose precision. Say we have 5 digits to work with (unrealistic but simple):
Code:

float a = 123.45;  -> 1.2345E2
float b = 0.102;    -> 1E-3
// align b to a -> b = 0.0010E2: the last 2 falls off the edge.
a -= b;
// a = 123.35, not 123.348

The calculation is inprecise because of the adjustment of the numbers.

Ths becomes most noticable when you add a small number from a large number, then subtract the original large number, e.g.:
Code:

float a = 2.0f;
float b = 0.00001f;
a += b;
a -= 2.0f;

if the compiler rounds that on each step [which it may not do], you may find that a is zero, rather than a small number (if there's enough zero's in the beginning of b - it may not be quite enough as it is above) .

Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.

--
Mats

regards,
George
• 04-23-2008
matsp
1. Yes, precision problems with float are more likely with add/subtract because the two numbers need to "line up", which means shifiting the smaller number by adding zeros on the left-hand side, which will lead to loss of precision.

2. That's a tricky one. If you know that you have two large numbers where one subtracts from the other, to form a small number, that you then add/subtract a small number from, make sure that it's done in the right order, don't subtract/add small numbers to large number "first", then subtract another large number.

--
Mats
• 04-23-2008
George2
Thanks Mats,

1.

--------------------
Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.
--------------------

x87 means?

2.

big number 1 - big number 2 - small number 3

other than

big number 1 - big number 3 - small number 2

Quote:

Originally Posted by matsp
1. Yes, precision problems with float are more likely with add/subtract because the two numbers need to "line up", which means shifiting the smaller number by adding zeros on the left-hand side, which will lead to loss of precision.

2. That's a tricky one. If you know that you have two large numbers where one subtracts from the other, to form a small number, that you then add/subtract a small number from, make sure that it's done in the right order, don't subtract/add small numbers to large number "first", then subtract another large number.

--
Mats

regards,
George
• 04-23-2008
matsp
1. x87 is the name of the FPU part of an x86 processor. It inherits it's name from when the 8087, 80287, 80387 where separate chips that plugged into the a socket next to the processor itself. Nowadays (since the time of 486 onwards), the x87 part of the processor is integrated in the same chip.

2. No, I mean instead of:
big1 - small - big2, do big1 - big2 - small - I may have confused you (and/or myself) in the explanation, but in symbols it should be clear.

--
Mats
• 04-24-2008
George2
Thanks Mats,

You are so knowledgable! Cool!

One more question, when performing intermediate calculation steps, e.g. to generate intermediate result, for example, a - b - c, intermediate result will be a - b, then using the intermediate result to minus c.

My question is, for store the intermediate result, what precise or type will be used? Float, double, decimal, unlimited length? Or the same as the original or result type? :-)

Quote:

Originally Posted by matsp
1. x87 is the name of the FPU part of an x86 processor. It inherits it's name from when the 8087, 80287, 80387 where separate chips that plugged into the a socket next to the processor itself. Nowadays (since the time of 486 onwards), the x87 part of the processor is integrated in the same chip.

2. No, I mean instead of:
big1 - small - big2, do big1 - big2 - small - I may have confused you (and/or myself) in the explanation, but in symbols it should be clear.

--
Mats

regards,
George
• 04-24-2008
matsp
Quote:

Originally Posted by George2
Thanks Mats,

You are so knowledgable! Cool!

One more question, when performing intermediate calculation steps, e.g. to generate intermediate result, for example, a - b - c, intermediate result will be a - b, then using the intermediate result to minus c.

My question is, for store the intermediate result, what precise or type will be used? Float, double, decimal, unlimited length? Or the same as the original or result type? :-)

regards,
George

The intermediate result will be "at least the precision of the greater type of a, b and c". If a, b and c are float, then the intermediate result will be float or better, if a, b and c are double, the intermediate will be at least double. The compiler is certainly not allowed to "loose" precision. But where it becomes an issue is that some compilers/processors allow for GREATER precision in the intermediate results, and others don't. This may even change with different builds (optimized vs. non-optimized, for example [1]) or options from the same compiler (e.g. -ffast-math on gcc will probably change this, along with the documented behaviour of that switch [which is that some exceptions may not be detected, and some obscure corner cases are dealt with in the way the hardware does it, rather than how IEEE-754 says they should be].)

[1] This particularly would happen when you do:
Code:

void func1(float a, float b, float c)
{
float x;
float y;

x = a - b;
y = x - c;
return y;
}

where the non-optimized version may well store x in memory after subtracting b from a, and load it back in again when doing the "y" line. In an optimized version, it would just leave x in a FPU register and continue calculating y from that value.

--
Mats
• 04-24-2008
George2
Thanks Mats,

1.

The optimization sample is cool. But I am wondering what is the relationship between this sample and my question -- how compiler deals with intermediate result's precision?

Quote:

Originally Posted by matsp

[1] This particularly would happen when you do:
Code:

void func1(float a, float b, float c)
{
float x;
float y;

x = a - b;
y = x - c;
return y;
}

where the non-optimized version may well store x in memory after subtracting b from a, and load it back in again when doing the "y" line. In an optimized version, it would just leave x in a FPU register and continue calculating y from that value.

--
Mats

2. A further question, what is the differences between NaN and Infinite?

regards,
George
• 04-24-2008
matsp
1. Intermediate results will (assuming they are not so complex that the compiler runs out of registers) be held in floating point registers for all (optimized) code, so intermediates are likely to be of higher precision than the variables if they aren't the same precision as the variable itself. But relying on this across compilers/platforms puts you in the "undefined behaviour" scenario.

My optimization example was more to show that you can get different results from the same code when the compiler optimizes and doesn't optimize - because in the non-optimized variant, the compiler is cropping the number to float before continuing the calculation, whilst in the optimized code, it doesn't crop the intermediate result, so the result will be (slighly) different.

2. NaN is "Not a number", which is the result of certain illegal math operations (such as sqrt(-1)).

Infinite is the "infinity" - this is where the number is bigger than the value that can be represented in the bits available.

If you get volume 5 of the X86-64 architecture documents, it describes all the x87 instructions, and part of that describes which result you get from which operation.

Available here:
http://www.amd.com/us-en/Processors/...9_7044,00.html

--
Mats
• 04-24-2008
George2
Cool, Mats!

IMHO, register is either 32-bit or 64-bit. I think you mean if hold in register for the intermediate value, it will be more precise than float. My question is, considering the length of register, will it be able to hold higher precision than float (in my points, the longer of storage, the higher precision and range)?

Quote:

Originally Posted by matsp
1. Intermediate results will (assuming they are not so complex that the compiler runs out of registers) be held in floating point registers for all (optimized) code, so intermediates are likely to be of higher precision than the variables if they aren't the same precision as the variable itself. But relying on this across compilers/platforms puts you in the "undefined behaviour" scenario.

My optimization example was more to show that you can get different results from the same code when the compiler optimizes and doesn't optimize - because in the non-optimized variant, the compiler is cropping the number to float before continuing the calculation, whilst in the optimized code, it doesn't crop the intermediate result, so the result will be (slighly) different.

2. NaN is "Not a number", which is the result of certain illegal math operations (such as sqrt(-1)).

Infinite is the "infinity" - this is where the number is bigger than the value that can be represented in the bits available.

If you get volume 5 of the X86-64 architecture documents, it describes all the x87 instructions, and part of that describes which result you get from which operation.

Available here:
http://www.amd.com/us-en/Processors/...9_7044,00.html

--
Mats

regards,
George
• 04-24-2008
matsp
Actually, x87 registers are 80-bit, so even intermediate results in 64-bit (double) would potentially have further precision[1], SSE registers are 64- or 32-bit. (In Windows-64, you'd be using SSE, not x87 for math).

[1] Although I think Windows defaults the x87 settings to "round to 64-bit on intermediate results".
--
Mats
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last