Thread: Problems with floating point (I hope you like it!)

  1. #1
    Registered User
    Join Date
    Feb 2019
    Posts
    498

    Problems with floating point (I hope you like it!)

    Sometimes, when we think about big numbers or precision we tend to think about floating point (float, double, long double). I like to think about floating point values as more precise than integers (having a fraction part), but "inexact" (in a sense they are approximations!). And, to make things worse, taking double type as an example, they have less significant bits available than its integer counterpart (long long versus double, f.i, both with 64 bits in size).

    The float type is the same as double, but have 24 bits of precision, 8 bits of expoent. The long double has 64 bits of precision and 15 bits of expoent (80 bits in size)... There is also a __float128 type on C99+ standards (113 bits of precision, 15 bits of expoent), but this type is not supported by hardware (co-processor or SSE).

    This text talks about IEEE 754 standard, used by almost all compilers... The term "precision" has two meanings when dealing with floating point. First, for printf(), is the count of decimal places, like in "%.2f" (2 decimal places after the dot). In IEEE 754 is the number of significant bits (see below: fractional bits plus the implicit 1.0).

    I made a claim that floating point isn't exact. Here's the demonstration:

    Code:
    // fptest1.c
    // Compile with: cc -o fptest1 fptest1.c
    #include <stdio.h>
    
    #define SHOW_RESULT(c) \
      printf("[%s]\n", ((c))?"yes":"no")
    
    void testfp1 ( void )
    {
      double x = 1.2;
    
      printf ( "x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? " );
    
      x -= 0.4;
      x -= 0.4;
      x -= 0.4;
    
      /* x should be 0.0, right!? Wrong! */
      SHOW_RESULT ( x == 0.0 );
    }
    
    void testfp2 ( void )
    {
      double x;
      double y;
    
      printf ( "x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? " );
    
      x = ( 0.1 + 0.2 ) + 0.3;
      y = 0.1 + ( 0.2 + 0.3 );
    
      /* x == y, right? Wrong! */
      SHOW_RESULT ( x == y );
    }
    
    void testfp3 ( void )
    {
      double x;
      double y;
    
      printf ( "x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? " );
    
      x = ( 0.1 * 0.2 ) * 0.3;
      y = 0.1 * ( 0.2 * 0.3 );
    
      /* x == y, right? Wrong! */
      SHOW_RESULT ( x == y );
    }
    
    void testfp4 ( void )
    {
      double x;
      double y;
    
      printf ( "x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? " );
    
      x = ( 0.1 + 0.2 ) * 0.3;
      y = ( 0.1 * 0.3 ) + ( 0.2 * 0.3 );
    
      /* x == y, right? Wrong! */
      SHOW_RESULT ( x == y );
    }
    
    int main ( void )
    {
      testfp1();
      testfp2();
      testfp3();
      testfp4();
    
      return 0;
    }
    The result is "[no]" for all tests!

    Code:
    $ cc -o fptest1 fptest1.c
    $ ./fptest1
    x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? [no]
    x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? [no]
    x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? [no]
    x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? [no]
    But why? The only values with fractional parts that can be represented exactly ends with 5, as in 0.5 or 3.1415. Integer values greater than 2⁵-1 are inexact as well because double deals only with 53 significant bits (not 64!). In the examples above, 1.2, 0.4, 0.1, 0.2 and 0.3 cannot be represented exactly, and this is easy to see:

    Code:
    // fptest2.c
    #include <stdio.h>
    
    void main( void )
    {
      double values[] = { 0.1, 0.2, 0.3, 0.4, 1.2, 0.0 };
      double *p = values;
    
      while ( *p )
        printf( "%.60f\n", *p++ );
    }
    This will give us:

    Code:
    $ cc -o fptest2 fptest2.c
    $ ./fptest2
    0.100000000000000005551115123125782702118158340454101562500000
    0.200000000000000011102230246251565404236316680908203125000000
    0.299999999999999988897769753748434595763683319091796875000000
    0.400000000000000022204460492503130808472633361816406250000000
    1.199999999999999955591079014993738383054733276367187500000000
    I'll consider only "normalized" representations of floating point values here to show you how values are stored in a floating point structure. It follows the following formula:

    Problems with floating point (I hope you like it!)-png-latex-png

    Where s, f and e are integer values. s is a single bit (1 means negative values), f is an integer of 52 bits and e is an 11 bits integer. It is clear, in a mathematical point of view, that floating point types don't deal with ℝ domain, but ℚ (rational) and, since the fraction has a power of 2 as denominator, only fractions ending with 5 can be obtained (I let the prof of that to you!). The expoent on the scale factor serves the purpose to make the binary "point" floats (hence the name)...

    Let's use 0.1 as an example. The code below shows this 3 parts of a double type:

    Code:
    // fptest3.c
    #include <stdio.h>
    #include <stdint.h>
    
    // double structure
    struct ds {
      uint64_t fraction:52;
      uint64_t expoent:11;
      uint64_t signal:1;
    } __attribute__((packed));
    
    void main( void )
    {
      double d = 0.1;
      struct ds *ds = (struct ds *)&d;
    
      printf( "s=%1$d, f=%2$lu (%2$#lx), e=%3$u\n",
        ds->signal, (uint64_t)ds->fraction, (uint32_t)ds->expoent );
    }

    Compiling and running:

    Code:
    $ ./fptest3
    s=0, f=2702159776422298 (0x999999999999a), e=1019
    $ bc <<< 'scale=60; (1 + 2702159776422298/(2^52))*2^(1019-1023)'
    .100000000000000005551115123125782702118158340454101562500000
    
    The compiler (and the processor) choose to round the numerator UP (the last hexadecimal digit is 'a', not '9') to minimize the error, getting the nearest representation as possible (rounding to nearest is the default). I used the 'bc' utility because multi precision library is used, not the tradicional IEEE 754 representation. Notice the value obtained is exactly the same we printed before...

    Since the numerator is 52 bits long, with that "implicit" 1.0 added to the fraction, the greater exact integer value, garanteed to be exact, can have no more than 53 bits:

    Code:
    // fptest4.c
    #include <stdio.h>
    #include <stdint.h>
    
    void main( void )
    {
      int64_t i = (1LL << 53)+1; // 54 bits integer.
      double d = i;
    
      printf( "%ld == %.1f ?\n", i, d );
    }
    This will print "9007199254740993 == 9007199254740992.0 ?". Obviously not!

    There are other side effects from using floating point. My point is "floating point are not a perfect solution to using bit values" and it must be used with care. A good material for reading is David Goldberg paper "What every computer scientist should know about floating point" (get the PDF here).

    []s
    Fred

  2. #2
    Registered User
    Join Date
    Feb 2019
    Posts
    498
    Another problem: If you use long double, in x86 architecture, the compiler is forced to use fp87 instructions instead of SSE. And the stack is always used for argument passing. And the structure of extended precision IEEE 754 (used by long double) isn't the same as float or double (supported by SSE):

    Code:
    long double f( long double a, long double b, long double c )
    {
      return a*b+c;
    }
    
    double g( double a, double b, double c )
    {
      return a*b+c;
    }
    Will create code like this, on x86-64:

    Code:
    ; NASM syntax...
    f:
      fld tword [rsp+8]     ; 1 cycle
      fld tword [rsp+24]    ; 1 cycle
      fmulp st1, st         ; 5 cycles
      fld tword [rsp+40]    ; 1 cycle
      faddp st1, st         ; 5 cycles
      ret
    
    g:
      mulsd xmm0, xmm1    ; 5 cycles
      addsd xmm0, xmm2    ; 3 cycles
      ret
    Not considering instruction pairing, g() is almost twice faster then f().

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,410
    And he had to go and spoil it by using void main
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. floating point problems
    By Bill83 in forum Windows Programming
    Replies: 4
    Last Post: 12-02-2005, 03:57 PM
  2. floating point value
    By twans in forum C++ Programming
    Replies: 9
    Last Post: 04-07-2005, 08:55 AM
  3. Floating-Point (BCD)
    By zx-1 in forum C Programming
    Replies: 1
    Last Post: 10-15-2004, 01:11 AM
  4. fixed point / floating point
    By confuted in forum Game Programming
    Replies: 4
    Last Post: 08-13-2002, 01:25 PM
  5. Floating point faster than fixed-point
    By VirtualAce in forum A Brief History of Cprogramming.com
    Replies: 5
    Last Post: 11-08-2001, 11:34 PM

Tags for this Thread