extended vs double precision

This is a discussion on extended vs double precision within the C Programming forums, part of the General Programming Boards category; i'm wondering how much greater is extended precision accuracy compared to double precision?...

  1. #1
    Registered User
    Join Date
    Sep 2007
    Posts
    15

    extended vs double precision

    i'm wondering how much greater is extended precision accuracy compared to double precision?

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Depends on the machine you are running on - in some machines, extended precision is 80 bits (x86 [x87], for example), in other machines it may be 128 bits, or 64 bits.

    If we assume it's x86, then you get 80 bits, you get:
    64 [1] bits of mantissa, which gives approximately 21-22 digits[2].
    15 bits of exponent.
    1 bit of sign.

    In double precision, the number conists of:
    52 [1] bits of mantissa, which gives approximately 17-18 digits[2].
    11 bits of exponent
    1 bit of sign.

    For completeness, single precision float consists of:
    23 [1] bits of mantissa, which gives approximately 7-8 digits[2].
    8 bits of exponent
    1 bit of sign.

    Note that the above is "best case" figures, and if you for example subtract numbers that are close to each other, the resulting number will be filled by zeros on the right-hand side. The number of resulting digits is dependant on the number of digits lost in such a division, e.g. 123456789.123456 - 1234556789.00000 will loose 9 digits in the calculation. The same applies when adding large and small numbers together. 123456789.0000 + 0.123456890123456789 will mean that only some of the latter number is used, becasue the two numbers are first "normalized" - this means that the exponent is equal on both numbers in the addition. The original input number is of course still retaining it's precision, but it's loosing some of it temporarily during the calculation.

    This "(temporary) loss of precision" means that sometimes, you need more digits during the middle of the calculation than you do at the end.


    [1] The mantissa has an implicit 1, which means that the number is actually 1 bit larger than the stored value [except for the value zero].

    [2] The digits that can be represented by a binary sequence in decimal form is "number_of_bits / log2(10)". Since log2(10) is approximately 3, it gives us the above figures.


    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Aug 2007
    Location
    MD, USA
    Posts
    71
    It's interesting to observe the differences between floats and doubles as they go through identical operations. I see that there is a difference depending on the compiler used as well!
    Code:
    #include <stdio.h>
    #define fnum   1.95
    #define loopct 100
    
    int main(void)
    {
               int i;
       char strnum[10] = "1.95";
             float f = fnum;
            double d = fnum;
      long double ld = fnum;
    
      printf("\n            float accuracy test starting value: &#37;s \n\n", strnum);
      printf("       float f:  sizeof: %2d , value: %.20f \n", sizeof(f), f);
      printf("      double d:  sizeof: %2d , value: %.20f \n", sizeof(d), d);
      printf("long double ld:  sizeof: %2d , value: %.30Lf \n", sizeof(ld), ld);
    
      printf("\n      looping x 2 ,,, x to quit: \n\n");
      for(i=0; i <= loopct; i++) {
        f  =  f * 2;
        d  =  d * 2;
        ld = ld * 2;
        printf("%3d  f= %30.22f d= %30.22f ", i, f, d);
        getchar();
      }
    
      printf("       float f: value: %.20f \n", f);
      printf("      double d: value: %.20f \n", d);
      printf("long double ld: value: %.30Lf \n", ld);
    
      return 0;
    }
    First I used the mingw gcc. Output for 1.95:
    Code:
                float accuracy test starting value: 1.95
    
           float f:  sizeof:  4 , value: 1.95000004768371580000
          double d:  sizeof:  8 , value: 1.95000000000000000000
    long double ld:  sizeof: 12 , value: -567251933470801750000000000000000000000000
    
          looping x 2 ,,, x to quit:
    
      0  f=       3.9000000953674316000000 d=       3.8999999999999999000000
      1  f=       7.8000001907348633000000 d=       7.7999999999999998000000
      2  f=      15.6000003814697270000000 d=      15.6000000000000000000000
      3  f=      31.2000007629394530000000 d=      31.1999999999999990000000
      4  f=      62.4000015258789060000000 d=      62.3999999999999990000000
      5  f=     124.8000030517578100000000 d=     124.8000000000000000000000
      6  f=     249.6000061035156200000000 d=     249.5999999999999900000000
      7  f=     499.2000122070312500000000 d=     499.1999999999999900000000
      8  f=     998.4000244140625000000000 d=     998.3999999999999800000000
      9  f=    1996.8000488281250000000000 d=    1996.8000000000000000000000
     10  f=    3993.6000976562500000000000 d=    3993.5999999999999000000000
     11  f=    7987.2001953125000000000000 d=    7987.1999999999998000000000
     12  f=   15974.4003906250000000000000 d=   15974.4000000000000000000000
     13  f=   31948.8007812500000000000000 d=   31948.7999999999990000000000
     14  f=   63897.6015625000000000000000 d=   63897.5999999999990000000000
     15  f=  127795.2031250000000000000000 d=  127795.2000000000000000000000
     16  f=  255590.4062500000000000000000 d=  255590.3999999999900000000000
     17  f=  511180.8125000000000000000000 d=  511180.7999999999900000000000
     18  f= 1022361.6250000000000000000000 d= 1022361.6000000000000000000000
     19  f= 2044723.2500000000000000000000 d= 2044723.2000000000000000000000
    I note that %Lf is not working right here.

    And this is using the Digital Mars dmc. Output for 1.95:
    Code:
                float accuracy test starting value: 1.95
    
           float f:  sizeof:  4 , value: 1.95000004768371582030
          double d:  sizeof:  8 , value: 1.94999999999999995550
    long double ld:  sizeof: 10 , value: 1.949999999999999955500000000000
    
          looping x 2 ,,, x to quit:
    
      0  f=       3.9000000953674316406000 d=       3.8999999999999999111000
      1  f=       7.8000001907348632812000 d=       7.7999999999999998223000
      2  f=      15.6000003814697265620000 d=      15.5999999999999996440000
      3  f=      31.2000007629394531250000 d=      31.1999999999999992890000
      4  f=      62.4000015258789062500000 d=      62.3999999999999985790000
      5  f=     124.8000030517578124900000 d=     124.7999999999999971500000
      6  f=     249.6000061035156249900000 d=     249.5999999999999943100000
      7  f=     499.2000122070312499900000 d=     499.1999999999999886200000
      8  f=     998.4000244140624999900000 d=     998.3999999999999772400000
      9  f=    1996.8000488281250000000000 d=    1996.7999999999999544000000
     10  f=    3993.6000976562500001000000 d=    3993.5999999999999090000000
     11  f=    7987.2001953125000002000000 d=    7987.1999999999998181000000
     12  f=   15974.4003906250000000000000 d=   15974.3999999999996360000000
     13  f=   31948.8007812500000010000000 d=   31948.7999999999992710000000
     14  f=   63897.6015625000000020000000 d=   63897.5999999999985430000000
     15  f=  127795.2031250000000000000000 d=  127795.1999999999970900000000
     16  f=  255590.4062500000000100000000 d=  255590.3999999999941800000000
     17  f=  511180.8125000000000300000000 d=  511180.7999999999883600000000
     18  f= 1022361.6249999999999000000000 d= 1022361.5999999999767000000000
     19  f= 2044723.2499999999999000000000 d= 2044723.1999999999534000000000
     20  f= 4089446.4999999999999000000000 d= 4089446.3999999999068000000000
    My calculator gets 2044723.20 at iteration 19... I 'guess' that's correct...
    Last edited by HowardL; 09-19-2007 at 03:07 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Copying 2-d arrays
    By Holtzy in forum C++ Programming
    Replies: 11
    Last Post: 03-14-2008, 04:44 PM
  2. Conversion From C++ To C
    By dicon in forum C++ Programming
    Replies: 2
    Last Post: 06-10-2007, 03:54 PM
  3. need some help with last part of arrays
    By Lince in forum C Programming
    Replies: 3
    Last Post: 11-18-2006, 09:13 AM
  4. newbie needs help with code
    By compudude86 in forum C Programming
    Replies: 6
    Last Post: 07-23-2006, 09:54 PM
  5. Unknown Math Issues.
    By Sir Andus in forum C++ Programming
    Replies: 1
    Last Post: 03-06-2006, 06:54 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21