Problems with floating point (I hope you like it!)

**flp1969** · 02-21-2019

Sometimes, when we think about big numbers or precision we tend to think about floating point (float, double, long double). I like to think about floating point values as more precise than integers (having a fraction part), but "inexact" (in a sense they are approximations!). And, to make things worse, taking double type as an example, they have less significant bits available than its integer counterpart (long long versus double, f.i, both with 64 bits in size).

The float type is the same as double, but have 24 bits of precision, 8 bits of expoent. The long double has 64 bits of precision and 15 bits of expoent (80 bits in size)... There is also a __float128 type on C99+ standards (113 bits of precision, 15 bits of expoent), but this type is not supported by hardware (co-processor or SSE).

This text talks about IEEE 754 standard, used by almost all compilers... The term "precision" has two meanings when dealing with floating point. First, for printf(), is the count of decimal places, like in "%.2f" (2 decimal places after the dot). In IEEE 754 is the number of significant bits (see below: fractional bits plus the implicit 1.0).

I made a claim that floating point isn't exact. Here's the demonstration:

Code:

// fptest1.c
// Compile with: cc -o fptest1 fptest1.c
#include <stdio.h>

#define SHOW_RESULT(c) \
  printf("[%s]\n", ((c))?"yes":"no")

void testfp1 ( void )
{
  double x = 1.2;

  printf ( "x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? " );

  x -= 0.4;
  x -= 0.4;
  x -= 0.4;

  /* x should be 0.0, right!? Wrong! */
  SHOW_RESULT ( x == 0.0 );
}

void testfp2 ( void )
{
  double x;
  double y;

  printf ( "x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? " );

  x = ( 0.1 + 0.2 ) + 0.3;
  y = 0.1 + ( 0.2 + 0.3 );

  /* x == y, right? Wrong! */
  SHOW_RESULT ( x == y );
}

void testfp3 ( void )
{
  double x;
  double y;

  printf ( "x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? " );

  x = ( 0.1 * 0.2 ) * 0.3;
  y = 0.1 * ( 0.2 * 0.3 );

  /* x == y, right? Wrong! */
  SHOW_RESULT ( x == y );
}

void testfp4 ( void )
{
  double x;
  double y;

  printf ( "x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? " );

  x = ( 0.1 + 0.2 ) * 0.3;
  y = ( 0.1 * 0.3 ) + ( 0.2 * 0.3 );

  /* x == y, right? Wrong! */
  SHOW_RESULT ( x == y );
}

int main ( void )
{
  testfp1();
  testfp2();
  testfp3();
  testfp4();

  return 0;
}

The result is "[no]" for all tests!

Code:

$ cc -o fptest1 fptest1.c
$ ./fptest1
x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? [no]
x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? [no]
x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? [no]
x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? [no]

But why? The only values with fractional parts that can be represented exactly ends with 5, as in 0.5 or 3.1415. Integer values greater than 2⁵³-1 are inexact as well because double deals only with 53 significant bits (not 64!). In the examples above, 1.2, 0.4, 0.1, 0.2 and 0.3 cannot be represented exactly, and this is easy to see:

Code:

// fptest2.c
#include <stdio.h>

void main( void )
{
  double values[] = { 0.1, 0.2, 0.3, 0.4, 1.2, 0.0 };
  double *p = values;

  while ( *p )
    printf( "%.60f\n", *p++ );
}

This will give us:

Code:

$ cc -o fptest2 fptest2.c
$ ./fptest2
0.100000000000000005551115123125782702118158340454101562500000
0.200000000000000011102230246251565404236316680908203125000000
0.299999999999999988897769753748434595763683319091796875000000
0.400000000000000022204460492503130808472633361816406250000000
1.199999999999999955591079014993738383054733276367187500000000

I'll consider only "normalized" representations of floating point values here to show you how values are stored in a floating point structure. It follows the following formula:

$Problems with floating point (I hope you like it!)-png-latex-png$

Where s, f and e are integer values. s is a single bit (1 means negative values), f is an integer of 52 bits and e is an 11 bits integer. It is clear, in a mathematical point of view, that floating point types don't deal with ℝ domain, but ℚ (rational) and, since the fraction has a power of 2 as denominator, only fractions ending with 5 can be obtained (I let the prof of that to you!). The expoent on the scale factor serves the purpose to make the binary "point" floats (hence the name)...

Let's use 0.1 as an example. The code below shows this 3 parts of a double type:

Code:

// fptest3.c
#include <stdio.h>
#include <stdint.h>

// double structure
struct ds {
  uint64_t fraction:52;
  uint64_t expoent:11;
  uint64_t signal:1;
} __attribute__((packed));

void main( void )
{
  double d = 0.1;
  struct ds *ds = (struct ds *)&d;

  printf( "s=%1$d, f=%2$lu (%2$#lx), e=%3$u\n",
    ds->signal, (uint64_t)ds->fraction, (uint32_t)ds->expoent );
}

Compiling and running:

Code:

$ ./fptest3
s=0, f=2702159776422298 (0x999999999999a), e=1019
$ bc <<< 'scale=60; (1 + 2702159776422298/(2^52))*2^(1019-1023)'
.100000000000000005551115123125782702118158340454101562500000

The compiler (and the processor) choose to round the numerator UP (the last hexadecimal digit is 'a', not '9') to minimize the error, getting the nearest representation as possible (rounding to nearest is the default). I used the 'bc' utility because multi precision library is used, not the tradicional IEEE 754 representation. Notice the value obtained is exactly the same we printed before...

Since the numerator is 52 bits long, with that "implicit" 1.0 added to the fraction, the greater exact integer value, garanteed to be exact, can have no more than 53 bits:

Code:

// fptest4.c
#include <stdio.h>
#include <stdint.h>

void main( void )
{
  int64_t i = (1LL << 53)+1; // 54 bits integer.
  double d = i;

  printf( "%ld == %.1f ?\n", i, d );
}

This will print "9007199254740993 == 9007199254740992.0 ?". Obviously not!

There are other side effects from using floating point. My point is "floating point are not a perfect solution to using bit values" and it must be used with care. A good material for reading is David Goldberg paper "What every computer scientist should know about floating point" (get the PDF here).

[]s
Fred

**flp1969** · 02-21-2019

Another problem: If you use long double, in x86 architecture, the compiler is forced to use fp87 instructions instead of SSE. And the stack is always used for argument passing. And the structure of extended precision IEEE 754 (used by long double) isn't the same as float or double (supported by SSE):

Code:

long double f( long double a, long double b, long double c )
{
  return a*b+c;
}

double g( double a, double b, double c )
{
  return a*b+c;
}

Will create code like this, on x86-64:

Code:

; NASM syntax...
f:
  fld tword [rsp+8]     ; 1 cycle
  fld tword [rsp+24]    ; 1 cycle
  fmulp st1, st         ; 5 cycles
  fld tword [rsp+40]    ; 1 cycle
  faddp st1, st         ; 5 cycles
  ret

g:
  mulsd xmm0, xmm1    ; 5 cycles
  addsd xmm0, xmm2    ; 3 cycles
  ret

Not considering instruction pairing, g() is almost twice faster then f().

**Salem** · 02-21-2019

And he had to go and spoil it by using void main

Thread: Problems with floating point (I hope you like it!)

Thread Tools

Search Thread

Display

Problems with floating point (I hope you like it!)

Similar Threads

floating point problems

floating point value

Floating-Point (BCD)

fixed point / floating point

Floating point faster than fixed-point

Tags for this Thread