Thread: floating point question

  1. #1
    Registered User
    Join Date
    Sep 2004
    Posts
    3

    floating point question

    I have a floating point question. The following is the source code:

    double sum, a[3];
    int i;

    sum = 0.0;

    for (i=0; i < 3; i++)
    sum += a[i];
    printf("%e\n", sum);

    sum = 0.0;
    i = 3;
    while (i--)
    sum += a[i];
    printf("%e\n", sum);

    return 0;
    }

    As I understand, floating point has the truncation problem if there are a lot of decimal places. The sum will be different if it adds from the first element to the last element in the array and from the last one to the first one. I tried to find the sample data, unfortunately, all the data that I have used have failed. Any suggestion is greatly appreciated.

    Thanks.

    Eric

  2. #2
    Obsessed with C chrismiceli's Avatar
    Join Date
    Jan 2003
    Posts
    501
    You never initialize a, you have no main function, you aren't using code tags.
    Last edited by chrismiceli; 09-09-2004 at 09:51 PM.
    Help populate a c/c++ help irc channel
    server: irc://irc.efnet.net
    channel: #c

  3. #3
    Registered User
    Join Date
    Sep 2004
    Posts
    3
    The question is what is the value for a[i] that can make the result different from the two loops.

  4. #4
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    You don't need an array. Just put sum += 1.0/3; in a loop and cycle through a half a million times or so...the sum should be different than 500000/3.
    If you understand what you're doing, you're not learning anything.

  5. #5
    Registered User
    Join Date
    Sep 2004
    Posts
    3
    Quote Originally Posted by itsme86
    You don't need an array. Just put sum += 1.0/3; in a loop and cycle through a half a million times or so...the sum should be different than 500000/3.

    yup, You can prove the floating point has a trucation issue. But I need to prove the array will have different sum if it adds the first one to the last one and vice versa. I just want to find out the data that should be contained in this array.

  6. #6
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Well if you're trying to compare two values by going through the loop two different ways, shouldn't you be storing each of the two values in a different variable? That way you can compare them?

    The issue isn't really with "truncation" per say, anyway. It's with the fact that floating point numbers, due to the way they're stored/work, can't accurately store every numeric possibility in the number of bits they're given. Perhaps you consider that truncation, but I don't. Anyway...

    USE [code] TAGS WHEN YOU POST CODE!
    Code:
    float s1 = 0.0, s2 = 0.0;
    double array[SIZE] = { 0.0 };
    int x;
    
    ...fill it...
    
    for( x = 0; x < SIZE; x++ )
    {
        s1 += array[x];
        s2 += array[SIZE-(x+1)];
    }
    
    ...display it...
    Quzah.
    Hope is the first step on the road to disappointment.

  7. #7
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by Eric Cheong
    As I understand, floating point has the truncation problem if there are a lot of decimal places. The sum will be different if it adds from the first element to the last element in the array and from the last one to the first one.

    Eric
    Your understanding is incomplete.

    Floating point numbers are limited in the number of significant digits that they can use to represent a number (not the number of decimal places). The following discussion is based on 32-bit floats used in Borland C, Microsoft Visual C++ and Gnu gcc. If you don't get the same results, you can investigate further.

    Now, if a number has more than 7 or 8 significant digits, it will not be representable exactly as a float.

    for example if you have, say

    Code:
    float x;
    x = 1000000000000.0; /* 1.0e12 */
    printf("x = %f\n", x);
    you will see something like

    x = 999999995904.000000
    (note that this is correct to 7 significant digits,as can be seen if you do the following:

    Code:
    printf("x (rounded to 7 significant digits) = %.6e\n", x);
    Then you see

    x (rounded to 7 significant digits) = 1.000000e+12
    Now what does this have to do with correctness as a consequence of the order in which numbers are added?

    Well if all of the numbers and all of the sums are exactly representable in the machine, there is no difference.

    I am going to give an extreme example where we have a large number and a lot of small numbers to add.

    Now, we can't expect to get exact results if the total is has more significant digits than is exactly representable. The question might be, "Well, if we can't get an exact result, what's the best we can do?"

    Try the following. Look at what's really happening. You will find that adding the small numbers first gives a result that is accurate to seven significant digits. Adding the small numbers last does not.
    (I have included the same procedure with a double, to show the exact answer.)


    The moral: If you have an array of floats, you may get better results if you sort the array and add the small numbers first.

    Happy truncating:

    Code:
    #include <stdio.h>
    int main()
    {
      float x;
      float y;
      double z;
    
      double RelErrorX;
      double RelErrorY;
    
      unsigned i;
    
    
      x = 1.0e12; /* note truncation error for 32-bit floats */
    
      printf("x (before adding the little ones)        = %f\n", x);
    
      for (i = 0; i < 10000000; i++) {
        x += 1.0;
      }
    
      printf("x (after  adding the little ones)        = %f\n", x);
      printf("x after rounding to 7 significant digits = %.6e\n\n", x);
    
      y = 0.0;
      for (i = 0; i < 10000000; i++) {
        y += 1.0;
      }
    
      printf("y (before adding the big one)            = %f\n", y);
    
      y += 1.0e12;
    
      printf("y (after  adding the big one)            = %f\n", y);
      printf("y after rounding to 7 significant digits = %.6e\n\n", y);
    
    
      z = 1.0e12; /* note that z is a double, this is represented exactly */
    
      printf("z (before adding the little ones)        = %f\n", z);
    
      for (i = 0; i < 10000000.0; i++) {
        z += 1.0;
      }
    
      printf("z (after  adding the little ones)        = %f\n\n", z);
    
    
      RelErrorX = (z - x)/z;
      printf("The relative error of x is %13.6le ( %lf percent)\n", 
              RelErrorX, 100.0*RelErrorX);
    
      RelErrorY = (z - y)/z;
      printf("The relative error of y is %13.6le (%lf percent)\n", 
              RelErrorY, 100.0*RelErrorY);
      return 0;
    }

    Regards,

    Dave
    Last edited by Dave Evans; 09-10-2004 at 09:06 AM.

  8. #8
    Registered User
    Join Date
    Mar 2004
    Posts
    536
    Quote Originally Posted by quzah
    The issue isn't really with "truncation" per say, anyway. It's with the fact that floating point numbers, due to the way they're stored/work, can't accurately store every numeric possibility in the number of bits they're given. Perhaps you consider that truncation, but I don't. Anyway...



    Quzah.
    But that's precisely what truncation error is, when applied to representation of a number with a finite number of digits:

    truncation error: In the representation of a number, the error introduced when one or more digits are dropped.
    The term "truncation error" when applied to an infinite series is used to designate the error introduced by stopping the summation after a finite number of terms (thus truncating the series).


    In numerical analysis, they usually use "truncation error" to mean the error introduced by truncating an infinite series, and "roundoff error" to indicate errors introduced by the inexact representation of arithmetic values due to the limitations of a finite number of digits.

    Dave
    Last edited by Dave Evans; 09-10-2004 at 10:07 AM.

  9. #9
    Obsessed with C chrismiceli's Avatar
    Join Date
    Jan 2003
    Posts
    501
    Help populate a c/c++ help irc channel
    server: irc://irc.efnet.net
    channel: #c

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How accurate is the following...
    By emeyer in forum C Programming
    Replies: 22
    Last Post: 12-07-2005, 12:07 PM
  2. Floating point #'s, why so much talk about it?
    By scuzzo84 in forum C Programming
    Replies: 5
    Last Post: 09-20-2005, 04:29 PM
  3. floating point operators
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 15
    Last Post: 10-22-2003, 07:53 PM
  4. floating point exception? what causes these?
    By salvelinus in forum C++ Programming
    Replies: 2
    Last Post: 10-26-2002, 12:12 PM
  5. Floating point numbers
    By bulsquare in forum C Programming
    Replies: 2
    Last Post: 04-10-2002, 04:44 AM