floating point question

**Eric Cheong** · 09-09-2004

I have a floating point question. The following is the source code:

double sum, a[3];
int i;

sum = 0.0;

for (i=0; i < 3; i++)
sum += a[i];
printf("%e\n", sum);

sum = 0.0;
i = 3;
while (i--)
sum += a[i];
printf("%e\n", sum);

return 0;
}

As I understand, floating point has the truncation problem if there are a lot of decimal places. The sum will be different if it adds from the first element to the last element in the array and from the last one to the first one. I tried to find the sample data, unfortunately, all the data that I have used have failed. Any suggestion is greatly appreciated.

Thanks.

Eric

**chrismiceli** · 09-09-2004

You never initialize a, you have no main function, you aren't using code tags.

**Eric Cheong** · 09-09-2004

The question is what is the value for a[i] that can make the result different from the two loops.

**itsme86** · 09-09-2004

You don't need an array. Just put sum += 1.0/3; in a loop and cycle through a half a million times or so...the sum should be different than 500000/3.

**Eric Cheong** · 09-09-2004

Originally Posted by itsme86

You don't need an array. Just put sum += 1.0/3; in a loop and cycle through a half a million times or so...the sum should be different than 500000/3.

yup, You can prove the floating point has a trucation issue. But I need to prove the array will have different sum if it adds the first one to the last one and vice versa. I just want to find out the data that should be contained in this array.

**quzah** · 09-09-2004

Well if you're trying to compare two values by going through the loop two different ways, shouldn't you be storing each of the two values in a different variable? That way you can compare them?

The issue isn't really with "truncation" per say, anyway. It's with the fact that floating point numbers, due to the way they're stored/work, can't accurately store every numeric possibility in the number of bits they're given. Perhaps you consider that truncation, but I don't. Anyway...

USE [code] TAGS WHEN YOU POST CODE!

Code:

float s1 = 0.0, s2 = 0.0;
double array[SIZE] = { 0.0 };
int x;

...fill it...

for( x = 0; x < SIZE; x++ )
{
    s1 += array[x];
    s2 += array[SIZE-(x+1)];
}

...display it...

Quzah.

**Dave Evans** · 09-10-2004

Originally Posted by Eric Cheong

As I understand, floating point has the truncation problem if there are a lot of decimal places. The sum will be different if it adds from the first element to the last element in the array and from the last one to the first one.

Eric

Your understanding is incomplete.

Floating point numbers are limited in the number of significant digits that they can use to represent a number (not the number of decimal places). The following discussion is based on 32-bit floats used in Borland C, Microsoft Visual C++ and Gnu gcc. If you don't get the same results, you can investigate further.

Now, if a number has more than 7 or 8 significant digits, it will not be representable exactly as a float.

for example if you have, say

Code:

float x;
x = 1000000000000.0; /* 1.0e12 */
printf("x = %f\n", x);

you will see something like

x = 999999995904.000000

(note that this is correct to 7 significant digits,as can be seen if you do the following:

Code:

printf("x (rounded to 7 significant digits) = %.6e\n", x);

Then you see

x (rounded to 7 significant digits) = 1.000000e+12

Now what does this have to do with correctness as a consequence of the order in which numbers are added?

Well if all of the numbers and all of the sums are exactly representable in the machine, there is no difference.

I am going to give an extreme example where we have a large number and a lot of small numbers to add.

Now, we can't expect to get exact results if the total is has more significant digits than is exactly representable. The question might be, "Well, if we can't get an exact result, what's the best we can do?"

Try the following. Look at what's really happening. You will find that adding the small numbers first gives a result that is accurate to seven significant digits. Adding the small numbers last does not.
(I have included the same procedure with a double, to show the exact answer.)

The moral: If you have an array of floats, you may get better results if you sort the array and add the small numbers first.

Happy truncating:

Code:

#include <stdio.h>
int main()
{
  float x;
  float y;
  double z;

  double RelErrorX;
  double RelErrorY;

  unsigned i;


  x = 1.0e12; /* note truncation error for 32-bit floats */

  printf("x (before adding the little ones)        = %f\n", x);

  for (i = 0; i < 10000000; i++) {
    x += 1.0;
  }

  printf("x (after  adding the little ones)        = %f\n", x);
  printf("x after rounding to 7 significant digits = %.6e\n\n", x);

  y = 0.0;
  for (i = 0; i < 10000000; i++) {
    y += 1.0;
  }

  printf("y (before adding the big one)            = %f\n", y);

  y += 1.0e12;

  printf("y (after  adding the big one)            = %f\n", y);
  printf("y after rounding to 7 significant digits = %.6e\n\n", y);


  z = 1.0e12; /* note that z is a double, this is represented exactly */

  printf("z (before adding the little ones)        = %f\n", z);

  for (i = 0; i < 10000000.0; i++) {
    z += 1.0;
  }

  printf("z (after  adding the little ones)        = %f\n\n", z);


  RelErrorX = (z - x)/z;
  printf("The relative error of x is %13.6le ( %lf percent)\n", 
          RelErrorX, 100.0*RelErrorX);

  RelErrorY = (z - y)/z;
  printf("The relative error of y is %13.6le (%lf percent)\n", 
          RelErrorY, 100.0*RelErrorY);
  return 0;
}

Regards,

Dave

**Dave Evans** · 09-10-2004

Originally Posted by quzah

The issue isn't really with "truncation" per say, anyway. It's with the fact that floating point numbers, due to the way they're stored/work, can't accurately store every numeric possibility in the number of bits they're given. Perhaps you consider that truncation, but I don't. Anyway...

Quzah.

But that's precisely what truncation error is, when applied to representation of a number with a finite number of digits:

truncation error: In the representation of a number, the error introduced when one or more digits are dropped.

The term "truncation error" when applied to an infinite series is used to designate the error introduced by stopping the summation after a finite number of terms (thus truncating the series).

In numerical analysis, they usually use "truncation error" to mean the error introduced by truncating an infinite series, and "roundoff error" to indicate errors introduced by the inexact representation of arithmetic values due to the limitations of a finite number of digits.

Dave

**chrismiceli** · 09-10-2004

In-Depth explanation
http://docs.sun.com/source/806-3568/ncg_goldberg.html

Thread: floating point question

Thread Tools

Search Thread

Display

floating point question

Similar Threads

How accurate is the following...

Floating point #'s, why so much talk about it?

floating point operators

floating point exception? what causes these?

Floating point numbers