Originally Posted by

**Eric Cheong**
As I understand, floating point has the truncation problem if there are a lot of decimal places. The sum will be different if it adds from the first element to the last element in the array and from the last one to the first one.

Eric

Your understanding is incomplete.

Floating point numbers are limited in the number of significant digits that they can use to represent a number (not the number of decimal places). The following discussion is based on 32-bit floats used in Borland C, Microsoft Visual C++ and Gnu gcc. If you don't get the same results, you can investigate further.

Now, if a number has more than 7 or 8 significant digits, it will not be representable exactly as a float.

for example if you have, say

Code:

float x;
x = 1000000000000.0; /* 1.0e12 */
printf("x = %f\n", x);

you will see something like

(note that this is correct to 7 significant digits,as can be seen if you do the following:

Code:

printf("x (rounded to 7 significant digits) = %.6e\n", x);

Then you see

x (rounded to 7 significant digits) = 1.000000e+12

Now what does this have to do with correctness as a consequence of the order in which numbers are added?

Well if all of the numbers and all of the sums are exactly representable in the machine, there is no difference.

I am going to give an extreme example where we have a large number and a lot of small numbers to add.

Now, we can't expect to get exact results if the total is has more significant digits than is exactly representable. The question might be, "Well, if we can't get an exact result, what's the best we can do?"

Try the following. Look at what's really happening. You will find that adding the small numbers first gives a result that is accurate to seven significant digits. Adding the small numbers last does not.

(I have included the same procedure with a double, to show the exact answer.)

The moral: If you have an array of floats, you *may* get better results if you sort the array and add the small numbers first.

Happy truncating:

Code:

#include <stdio.h>
int main()
{
float x;
float y;
double z;
double RelErrorX;
double RelErrorY;
unsigned i;
x = 1.0e12; /* note truncation error for 32-bit floats */
printf("x (before adding the little ones) = %f\n", x);
for (i = 0; i < 10000000; i++) {
x += 1.0;
}
printf("x (after adding the little ones) = %f\n", x);
printf("x after rounding to 7 significant digits = %.6e\n\n", x);
y = 0.0;
for (i = 0; i < 10000000; i++) {
y += 1.0;
}
printf("y (before adding the big one) = %f\n", y);
y += 1.0e12;
printf("y (after adding the big one) = %f\n", y);
printf("y after rounding to 7 significant digits = %.6e\n\n", y);
z = 1.0e12; /* note that z is a double, this is represented exactly */
printf("z (before adding the little ones) = %f\n", z);
for (i = 0; i < 10000000.0; i++) {
z += 1.0;
}
printf("z (after adding the little ones) = %f\n\n", z);
RelErrorX = (z - x)/z;
printf("The relative error of x is %13.6le ( %lf percent)\n",
RelErrorX, 100.0*RelErrorX);
RelErrorY = (z - y)/z;
printf("The relative error of y is %13.6le (%lf percent)\n",
RelErrorY, 100.0*RelErrorY);
return 0;
}

Regards,

Dave