extended vs double precision

**Lind** · 09-19-2007

i'm wondering how much greater is extended precision accuracy compared to double precision?

**matsp** · 09-19-2007

Depends on the machine you are running on - in some machines, extended precision is 80 bits (x86 [x87], for example), in other machines it may be 128 bits, or 64 bits.

If we assume it's x86, then you get 80 bits, you get:
64 [1] bits of mantissa, which gives approximately 21-22 digits[2].
15 bits of exponent.
1 bit of sign.

In double precision, the number conists of:
52 [1] bits of mantissa, which gives approximately 17-18 digits[2].
11 bits of exponent
1 bit of sign.

For completeness, single precision float consists of:
23 [1] bits of mantissa, which gives approximately 7-8 digits[2].
8 bits of exponent
1 bit of sign.

Note that the above is "best case" figures, and if you for example subtract numbers that are close to each other, the resulting number will be filled by zeros on the right-hand side. The number of resulting digits is dependant on the number of digits lost in such a division, e.g. 123456789.123456 - 1234556789.00000 will loose 9 digits in the calculation. The same applies when adding large and small numbers together. 123456789.0000 + 0.123456890123456789 will mean that only some of the latter number is used, becasue the two numbers are first "normalized" - this means that the exponent is equal on both numbers in the addition. The original input number is of course still retaining it's precision, but it's loosing some of it temporarily during the calculation.

This "(temporary) loss of precision" means that sometimes, you need more digits during the middle of the calculation than you do at the end.

[1] The mantissa has an implicit 1, which means that the number is actually 1 bit larger than the stored value [except for the value zero].

[2] The digits that can be represented by a binary sequence in decimal form is "number_of_bits / log2(10)". Since log2(10) is approximately 3, it gives us the above figures.

--
Mats

**HowardL** · 09-19-2007

It's interesting to observe the differences between floats and doubles as they go through identical operations. I see that there is a difference depending on the compiler used as well!

Code:

#include <stdio.h>
#define fnum   1.95
#define loopct 100

int main(void)
{
           int i;
   char strnum[10] = "1.95";
         float f = fnum;
        double d = fnum;
  long double ld = fnum;

  printf("\n            float accuracy test starting value: &#37;s \n\n", strnum);
  printf("       float f:  sizeof: %2d , value: %.20f \n", sizeof(f), f);
  printf("      double d:  sizeof: %2d , value: %.20f \n", sizeof(d), d);
  printf("long double ld:  sizeof: %2d , value: %.30Lf \n", sizeof(ld), ld);

  printf("\n      looping x 2 ,,, x to quit: \n\n");
  for(i=0; i <= loopct; i++) {
    f  =  f * 2;
    d  =  d * 2;
    ld = ld * 2;
    printf("%3d  f= %30.22f d= %30.22f ", i, f, d);
    getchar();
  }

  printf("       float f: value: %.20f \n", f);
  printf("      double d: value: %.20f \n", d);
  printf("long double ld: value: %.30Lf \n", ld);

  return 0;
}

First I used the mingw gcc. Output for 1.95:

Code:

            float accuracy test starting value: 1.95

       float f:  sizeof:  4 , value: 1.95000004768371580000
      double d:  sizeof:  8 , value: 1.95000000000000000000
long double ld:  sizeof: 12 , value: -567251933470801750000000000000000000000000

      looping x 2 ,,, x to quit:

  0  f=       3.9000000953674316000000 d=       3.8999999999999999000000
  1  f=       7.8000001907348633000000 d=       7.7999999999999998000000
  2  f=      15.6000003814697270000000 d=      15.6000000000000000000000
  3  f=      31.2000007629394530000000 d=      31.1999999999999990000000
  4  f=      62.4000015258789060000000 d=      62.3999999999999990000000
  5  f=     124.8000030517578100000000 d=     124.8000000000000000000000
  6  f=     249.6000061035156200000000 d=     249.5999999999999900000000
  7  f=     499.2000122070312500000000 d=     499.1999999999999900000000
  8  f=     998.4000244140625000000000 d=     998.3999999999999800000000
  9  f=    1996.8000488281250000000000 d=    1996.8000000000000000000000
 10  f=    3993.6000976562500000000000 d=    3993.5999999999999000000000
 11  f=    7987.2001953125000000000000 d=    7987.1999999999998000000000
 12  f=   15974.4003906250000000000000 d=   15974.4000000000000000000000
 13  f=   31948.8007812500000000000000 d=   31948.7999999999990000000000
 14  f=   63897.6015625000000000000000 d=   63897.5999999999990000000000
 15  f=  127795.2031250000000000000000 d=  127795.2000000000000000000000
 16  f=  255590.4062500000000000000000 d=  255590.3999999999900000000000
 17  f=  511180.8125000000000000000000 d=  511180.7999999999900000000000
 18  f= 1022361.6250000000000000000000 d= 1022361.6000000000000000000000
 19  f= 2044723.2500000000000000000000 d= 2044723.2000000000000000000000

I note that %Lf is not working right here.

And this is using the Digital Mars dmc. Output for 1.95:

Code:

            float accuracy test starting value: 1.95

       float f:  sizeof:  4 , value: 1.95000004768371582030
      double d:  sizeof:  8 , value: 1.94999999999999995550
long double ld:  sizeof: 10 , value: 1.949999999999999955500000000000

      looping x 2 ,,, x to quit:

  0  f=       3.9000000953674316406000 d=       3.8999999999999999111000
  1  f=       7.8000001907348632812000 d=       7.7999999999999998223000
  2  f=      15.6000003814697265620000 d=      15.5999999999999996440000
  3  f=      31.2000007629394531250000 d=      31.1999999999999992890000
  4  f=      62.4000015258789062500000 d=      62.3999999999999985790000
  5  f=     124.8000030517578124900000 d=     124.7999999999999971500000
  6  f=     249.6000061035156249900000 d=     249.5999999999999943100000
  7  f=     499.2000122070312499900000 d=     499.1999999999999886200000
  8  f=     998.4000244140624999900000 d=     998.3999999999999772400000
  9  f=    1996.8000488281250000000000 d=    1996.7999999999999544000000
 10  f=    3993.6000976562500001000000 d=    3993.5999999999999090000000
 11  f=    7987.2001953125000002000000 d=    7987.1999999999998181000000
 12  f=   15974.4003906250000000000000 d=   15974.3999999999996360000000
 13  f=   31948.8007812500000010000000 d=   31948.7999999999992710000000
 14  f=   63897.6015625000000020000000 d=   63897.5999999999985430000000
 15  f=  127795.2031250000000000000000 d=  127795.1999999999970900000000
 16  f=  255590.4062500000000100000000 d=  255590.3999999999941800000000
 17  f=  511180.8125000000000300000000 d=  511180.7999999999883600000000
 18  f= 1022361.6249999999999000000000 d= 1022361.5999999999767000000000
 19  f= 2044723.2499999999999000000000 d= 2044723.1999999999534000000000
 20  f= 4089446.4999999999999000000000 d= 4089446.3999999999068000000000

My calculator gets 2044723.20 at iteration 19... I 'guess' that's correct...

Thread: extended vs double precision

Thread Tools

Search Thread

Display

extended vs double precision

Similar Threads

Copying 2-d arrays

Conversion From C++ To C

need some help with last part of arrays

newbie needs help with code

Unknown Math Issues.