Thread: issues in float division a/b

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    1,579

    issues in float division a/b

    Hello everyone,


    If a and b are both float (or double, I think it does not matter?), and I want to calculate a/b. I am wondering whether there will be any issues which make the calculation result in-accurate compared with the calculation result by-hand (on a paper). :-)

    For example, if a is too big, or if b is too small? If such issue exists, what is the best practices to calculate a/b?


    thanks in advance,
    George

  2. #2
    Confused Magos's Avatar
    Join Date
    Sep 2001
    Location
    Sweden
    Posts
    3,145
    There's always a risk that a floating point operation results in loss due to the finite precision of computers. However I very much doubt you'll get problems with this, unless you work for NASA or something, where precision is cruical .

    In general you should always using double instead of float since internally on the (modern) processor they're treated the same. The only reason to choose float would be for storage reasons (like saving to a file), since it uses 4 instead of 8 bytes.
    MagosX.com

    Give a man a fish and you feed him for a day.
    Teach a man to fish and you feed him for a lifetime.

  3. #3
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Magos,


    Quote Originally Posted by Magos View Post
    There's always a risk that a floating point operation results in loss due to the finite precision of computers. However I very much doubt you'll get problems with this, unless you work for NASA or something, where precision is cruical .

    In general you should always using double instead of float since internally on the (modern) processor they're treated the same. The only reason to choose float would be for storage reasons (like saving to a file), since it uses 4 instead of 8 bytes.
    I do not understand what means "they're treated the same", but size of storage is not the same -- "it uses 4 instead of 8 bytes"? What do you mean treated the same?


    regards,
    George

  4. #4
    Confused Magos's Avatar
    Join Date
    Sep 2001
    Location
    Sweden
    Posts
    3,145
    When you use a float in a calculation the processor is actually using a double then "cut off" the bits that will not be used in the result.

    A simple example:

    You want to calculate: 0.12f + 3.45f (floats)
    The processor actually calculates: 0.1200 + 3.4500 (doubles)
    The result is: 3.5700 (double)
    The result is cut off and presented to you: 3.57f (float)
    MagosX.com

    Give a man a fish and you feed him for a day.
    Teach a man to fish and you feed him for a lifetime.

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Also note that when dividing floating point, we divide the dividend's mantissa by the divisor's mantissa, then adjust the exponents by subtracting the divisor's exponent from the dividend's. And of course, sign is adjusted to match the combination of the sign's of the dividend and divisor (I believe XOR will perform that correctly).

    So whether the numbers are small or large makes no difference to the actual precision of the result. Obviously, double having more bits meaning a more precise result can be produced.

    Multiply is essentially the same, except of course it multiplies the two components mantissa and adds the exponents.

    Much more sensitive to inputs is the add and subtract, since the two numbers have to be normalized (mantissas need to be aligned so that the "decimal point inside the mantissa" is in the same place), so adding or subtracting numbers that have a huge difference in magnitude will loose precision. Say we have 5 digits to work with (unrealistic but simple):
    Code:
    float a = 123.45;  -> 1.2345E2
    float b = 0.102;    -> 1E-3
    // align b to a -> b = 0.0010E2: the last 2 falls off the edge. 
    a -= b; 
    // a = 123.35, not 123.348
    The calculation is inprecise because of the adjustment of the numbers.

    Ths becomes most noticable when you add a small number from a large number, then subtract the original large number, e.g.:
    Code:
    float a = 2.0f;
    float b = 0.00001f;
    a += b;
    a -= 2.0f;
    if the compiler rounds that on each step [which it may not do], you may find that a is zero, rather than a small number (if there's enough zero's in the beginning of b - it may not be quite enough as it is above) .

    Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Mats,


    1.

    Your reply is great! So, I think the accuracy issue is add/minus, not multiply/divide, right?

    2.

    Any advice or best practices to solve the issue in add/minus? :-)

    Quote Originally Posted by matsp View Post
    Also note that when dividing floating point, we divide the dividend's mantissa by the divisor's mantissa, then adjust the exponents by subtracting the divisor's exponent from the dividend's. And of course, sign is adjusted to match the combination of the sign's of the dividend and divisor (I believe XOR will perform that correctly).

    So whether the numbers are small or large makes no difference to the actual precision of the result. Obviously, double having more bits meaning a more precise result can be produced.

    Multiply is essentially the same, except of course it multiplies the two components mantissa and adds the exponents.

    Much more sensitive to inputs is the add and subtract, since the two numbers have to be normalized (mantissas need to be aligned so that the "decimal point inside the mantissa" is in the same place), so adding or subtracting numbers that have a huge difference in magnitude will loose precision. Say we have 5 digits to work with (unrealistic but simple):
    Code:
    float a = 123.45;  -> 1.2345E2
    float b = 0.102;    -> 1E-3
    // align b to a -> b = 0.0010E2: the last 2 falls off the edge. 
    a -= b; 
    // a = 123.35, not 123.348
    The calculation is inprecise because of the adjustment of the numbers.

    Ths becomes most noticable when you add a small number from a large number, then subtract the original large number, e.g.:
    Code:
    float a = 2.0f;
    float b = 0.00001f;
    a += b;
    a -= 2.0f;
    if the compiler rounds that on each step [which it may not do], you may find that a is zero, rather than a small number (if there's enough zero's in the beginning of b - it may not be quite enough as it is above) .

    Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.

    --
    Mats

    regards,
    George

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    1. Yes, precision problems with float are more likely with add/subtract because the two numbers need to "line up", which means shifiting the smaller number by adding zeros on the left-hand side, which will lead to loss of precision.

    2. That's a tricky one. If you know that you have two large numbers where one subtracts from the other, to form a small number, that you then add/subtract a small number from, make sure that it's done in the right order, don't subtract/add small numbers to large number "first", then subtract another large number.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Mats,


    1.

    --------------------
    Also note that Magos's example showing that the intermediate calculation is done in double precision only applies to SOME floating point processors - x87 for example always calculates at full precision, then rounds it to the precision requested. But other processors/compilers may choose to perform the intermediate calculation in float precision.
    --------------------

    x87 means?

    2.

    Your proposed solution below is,

    big number 1 - big number 2 - small number 3

    other than

    big number 1 - big number 3 - small number 2

    Quote Originally Posted by matsp View Post
    1. Yes, precision problems with float are more likely with add/subtract because the two numbers need to "line up", which means shifiting the smaller number by adding zeros on the left-hand side, which will lead to loss of precision.

    2. That's a tricky one. If you know that you have two large numbers where one subtracts from the other, to form a small number, that you then add/subtract a small number from, make sure that it's done in the right order, don't subtract/add small numbers to large number "first", then subtract another large number.

    --
    Mats

    regards,
    George

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    1. x87 is the name of the FPU part of an x86 processor. It inherits it's name from when the 8087, 80287, 80387 where separate chips that plugged into the a socket next to the processor itself. Nowadays (since the time of 486 onwards), the x87 part of the processor is integrated in the same chip.

    2. No, I mean instead of:
    big1 - small - big2, do big1 - big2 - small - I may have confused you (and/or myself) in the explanation, but in symbols it should be clear.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Mats,


    You are so knowledgable! Cool!

    One more question, when performing intermediate calculation steps, e.g. to generate intermediate result, for example, a - b - c, intermediate result will be a - b, then using the intermediate result to minus c.

    My question is, for store the intermediate result, what precise or type will be used? Float, double, decimal, unlimited length? Or the same as the original or result type? :-)

    Quote Originally Posted by matsp View Post
    1. x87 is the name of the FPU part of an x86 processor. It inherits it's name from when the 8087, 80287, 80387 where separate chips that plugged into the a socket next to the processor itself. Nowadays (since the time of 486 onwards), the x87 part of the processor is integrated in the same chip.

    2. No, I mean instead of:
    big1 - small - big2, do big1 - big2 - small - I may have confused you (and/or myself) in the explanation, but in symbols it should be clear.

    --
    Mats

    regards,
    George

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by George2 View Post
    Thanks Mats,


    You are so knowledgable! Cool!

    One more question, when performing intermediate calculation steps, e.g. to generate intermediate result, for example, a - b - c, intermediate result will be a - b, then using the intermediate result to minus c.

    My question is, for store the intermediate result, what precise or type will be used? Float, double, decimal, unlimited length? Or the same as the original or result type? :-)




    regards,
    George
    The intermediate result will be "at least the precision of the greater type of a, b and c". If a, b and c are float, then the intermediate result will be float or better, if a, b and c are double, the intermediate will be at least double. The compiler is certainly not allowed to "loose" precision. But where it becomes an issue is that some compilers/processors allow for GREATER precision in the intermediate results, and others don't. This may even change with different builds (optimized vs. non-optimized, for example [1]) or options from the same compiler (e.g. -ffast-math on gcc will probably change this, along with the documented behaviour of that switch [which is that some exceptions may not be detected, and some obscure corner cases are dealt with in the way the hardware does it, rather than how IEEE-754 says they should be].)

    [1] This particularly would happen when you do:
    Code:
    void func1(float a, float b, float c)
    {
       float x;
       float y;
    
       x = a - b;
       y = x - c;
       return y;
    }
    where the non-optimized version may well store x in memory after subtracting b from a, and load it back in again when doing the "y" line. In an optimized version, it would just leave x in a FPU register and continue calculating y from that value.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Mats,


    1.

    The optimization sample is cool. But I am wondering what is the relationship between this sample and my question -- how compiler deals with intermediate result's precision?

    Quote Originally Posted by matsp View Post

    [1] This particularly would happen when you do:
    Code:
    void func1(float a, float b, float c)
    {
       float x;
       float y;
    
       x = a - b;
       y = x - c;
       return y;
    }
    where the non-optimized version may well store x in memory after subtracting b from a, and load it back in again when doing the "y" line. In an optimized version, it would just leave x in a FPU register and continue calculating y from that value.

    --
    Mats
    2. A further question, what is the differences between NaN and Infinite?


    regards,
    George

  13. #13
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    1. Intermediate results will (assuming they are not so complex that the compiler runs out of registers) be held in floating point registers for all (optimized) code, so intermediates are likely to be of higher precision than the variables if they aren't the same precision as the variable itself. But relying on this across compilers/platforms puts you in the "undefined behaviour" scenario.

    My optimization example was more to show that you can get different results from the same code when the compiler optimizes and doesn't optimize - because in the non-optimized variant, the compiler is cropping the number to float before continuing the calculation, whilst in the optimized code, it doesn't crop the intermediate result, so the result will be (slighly) different.

    2. NaN is "Not a number", which is the result of certain illegal math operations (such as sqrt(-1)).

    Infinite is the "infinity" - this is where the number is bigger than the value that can be represented in the bits available.

    If you get volume 5 of the X86-64 architecture documents, it describes all the x87 instructions, and part of that describes which result you get from which operation.

    Available here:
    http://www.amd.com/us-en/Processors/...9_7044,00.html

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #14
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Cool, Mats!


    IMHO, register is either 32-bit or 64-bit. I think you mean if hold in register for the intermediate value, it will be more precise than float. My question is, considering the length of register, will it be able to hold higher precision than float (in my points, the longer of storage, the higher precision and range)?

    Any comments?

    Quote Originally Posted by matsp View Post
    1. Intermediate results will (assuming they are not so complex that the compiler runs out of registers) be held in floating point registers for all (optimized) code, so intermediates are likely to be of higher precision than the variables if they aren't the same precision as the variable itself. But relying on this across compilers/platforms puts you in the "undefined behaviour" scenario.

    My optimization example was more to show that you can get different results from the same code when the compiler optimizes and doesn't optimize - because in the non-optimized variant, the compiler is cropping the number to float before continuing the calculation, whilst in the optimized code, it doesn't crop the intermediate result, so the result will be (slighly) different.

    2. NaN is "Not a number", which is the result of certain illegal math operations (such as sqrt(-1)).

    Infinite is the "infinity" - this is where the number is bigger than the value that can be represented in the bits available.

    If you get volume 5 of the X86-64 architecture documents, it describes all the x87 instructions, and part of that describes which result you get from which operation.

    Available here:
    http://www.amd.com/us-en/Processors/...9_7044,00.html

    --
    Mats

    regards,
    George

  15. #15
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Actually, x87 registers are 80-bit, so even intermediate results in 64-bit (double) would potentially have further precision[1], SSE registers are 64- or 32-bit. (In Windows-64, you'd be using SSE, not x87 for math).

    [1] Although I think Windows defaults the x87 settings to "round to 64-bit on intermediate results".
    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 05-13-2009, 03:25 PM
  2. Replies: 14
    Last Post: 06-28-2006, 01:58 AM
  3. Could somebody please help me with this C program
    By brett73 in forum C Programming
    Replies: 6
    Last Post: 11-25-2004, 02:19 AM
  4. Half-life SDK, where are the constants?
    By bennyandthejets in forum Game Programming
    Replies: 29
    Last Post: 08-25-2003, 11:58 AM
  5. How do you search & sort an array?
    By sketchit in forum C Programming
    Replies: 30
    Last Post: 11-03-2001, 05:26 PM