Sometimes, when we think about big numbers or precision we tend to think about floating point (float, double, long double). I like to think about floating point values as more precise than integers (having a fraction part), but "inexact" (in a sense they are approximations!). And, to make things worse, taking double type as an example, they have less significant bits available than its integer counterpart (long long versus double, f.i, both with 64 bits in size).

The float type is the same as double, but have 24 bits of precision, 8 bits of expoent. The long double has 64 bits of precision and 15 bits of expoent (80 bits in size)... There is also a __float128 type on C99+ standards (113 bits of precision, 15 bits of expoent), but this type is not supported by hardware (co-processor or SSE).

This text talks about IEEE 754 standard, used by almost all compilers... The term "precision" has two meanings when dealing with floating point. First, for printf(), is the count of decimal places, like in "%.2f" (2 decimal places after the dot). In IEEE 754 is the number of significant bits (see below: fractional bits plus the implicit 1.0).

I made a claim that floating point isn't exact. Here's the demonstration:

The result is "[no]" for all tests!Code:// fptest1.c // Compile with: cc -o fptest1 fptest1.c #include <stdio.h> #define SHOW_RESULT(c) \ printf("[%s]\n", ((c))?"yes":"no") void testfp1 ( void ) { double x = 1.2; printf ( "x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? " ); x -= 0.4; x -= 0.4; x -= 0.4; /* x should be 0.0, right!? Wrong! */ SHOW_RESULT ( x == 0.0 ); } void testfp2 ( void ) { double x; double y; printf ( "x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? " ); x = ( 0.1 + 0.2 ) + 0.3; y = 0.1 + ( 0.2 + 0.3 ); /* x == y, right? Wrong! */ SHOW_RESULT ( x == y ); } void testfp3 ( void ) { double x; double y; printf ( "x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? " ); x = ( 0.1 * 0.2 ) * 0.3; y = 0.1 * ( 0.2 * 0.3 ); /* x == y, right? Wrong! */ SHOW_RESULT ( x == y ); } void testfp4 ( void ) { double x; double y; printf ( "x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? " ); x = ( 0.1 + 0.2 ) * 0.3; y = ( 0.1 * 0.3 ) + ( 0.2 * 0.3 ); /* x == y, right? Wrong! */ SHOW_RESULT ( x == y ); } int main ( void ) { testfp1(); testfp2(); testfp3(); testfp4(); return 0; }

But why? The only values with fractional parts that can be represented exactly ends with 5, as in 0.5 or 3.1415. Integer values greater than 2⁵³-1 are inexact as well because double deals only with 53 significant bits (not 64!). In the examples above, 1.2, 0.4, 0.1, 0.2 and 0.3 cannot be represented exactly, and this is easy to see:Code:$ cc -o fptest1 fptest1.c $ ./fptest1 x = 1.2 - 0.4 - 0.4 - 0.4; x = 0.0? [no] x = (0.1 + 0.2) + 0.3; y = 0.1 + (0.2 + 0.3); x == y ? [no] x = (0.1 * 0.2) * 0.3; y = 0.1 * (0.2 * 0.3); x == y? [no] x = (0.1 + 0.2) * 0.3; y = (0.1 * 0.3) + (0.2 * 0.3); x == y? [no]

This will give us:Code:// fptest2.c #include <stdio.h> void main( void ) { double values[] = { 0.1, 0.2, 0.3, 0.4, 1.2, 0.0 }; double *p = values; while ( *p ) printf( "%.60f\n", *p++ ); }

I'll consider only "normalized" representations of floating point values here to show you how values are stored in a floating point structure. It follows the following formula:Code:$ cc -o fptest2 fptest2.c $ ./fptest2 0.100000000000000005551115123125782702118158340454101562500000 0.200000000000000011102230246251565404236316680908203125000000 0.299999999999999988897769753748434595763683319091796875000000 0.400000000000000022204460492503130808472633361816406250000000 1.199999999999999955591079014993738383054733276367187500000000

Where s, f and e are integer values. s is a single bit (1 means negative values), f is an integer of 52 bits and e is an 11 bits integer. It is clear, in a mathematical point of view, that floating point types don't deal with ℝ domain, but ℚ (rational) and, since the fraction has a power of 2 as denominator, only fractions ending with 5 can be obtained (I let the prof of that to you!). The expoent on the scale factor serves the purpose to make the binary "point" floats (hence the name)...

Let's use 0.1 as an example. The code below shows this 3 parts of a double type:

Code:// fptest3.c #include <stdio.h> #include <stdint.h> // double structure struct ds { uint64_t fraction:52; uint64_t expoent:11; uint64_t signal:1; } __attribute__((packed)); void main( void ) { double d = 0.1; struct ds *ds = (struct ds *)&d; printf( "s=%1$d, f=%2$lu (%2$#lx), e=%3$u\n", ds->signal, (uint64_t)ds->fraction, (uint32_t)ds->expoent ); }

Compiling and running:

The compiler (and the processor) choose to round the numerator UP (the last hexadecimal digit is 'a', not '9') to minimize the error, getting the nearest representation as possible (rounding to nearest is the default). I used the 'bc' utility because multi precision library is used, not the tradicional IEEE 754 representation. Notice the value obtained is exactly the same we printed before...Code:`$ ./fptest3 s=0, f=2702159776422298 (0x999999999999a), e=1019 $ bc <<< 'scale=60; (1 + 2702159776422298/(2^52))*2^(1019-1023)' .100000000000000005551115123125782702118158340454101562500000`

Since the numerator is 52 bits long, with that "implicit" 1.0 added to the fraction, the greater exact integer value, garanteed to be exact, can have no more than 53 bits:

This will print "9007199254740993 == 9007199254740992.0 ?". Obviously not!Code:// fptest4.c #include <stdio.h> #include <stdint.h> void main( void ) { int64_t i = (1LL << 53)+1; // 54 bits integer. double d = i; printf( "%ld == %.1f ?\n", i, d ); }

There are other side effects from using floating point. My point is "floating point are not a perfect solution to using bit values" and it must be used with care. A good material for reading is David Goldberg paper "What every computer scientist should know about floating point" (get the PDF here).

[]s

Fred