# Thread: avoiding numeric/truncation errors

1. ## avoiding numeric/truncation errors

What is the best way to avoid truncation and rounding errors? I'm attempting to implement some math subprograms like finding the volume of a sphere, which takes a (4/3)PI(Rcubed) formula, but the (4/3) will obviously round down to 1.00 rather than the 1.33 I need, destroying my formula. Any advice? Thanks

Another part of the program has me also taking an array of float values, sorting them into ascending order, and displaying them along with their sum, however this same truncation/rounding error causes me some problems here too, because many of the float values are decimal which are not being representing on output.

2. If you do the calculations with a floating type, they'll be (more) precise. 4.0/3, for example, will give a value around 1.3333, as a double.

I'm not sure what "decimal which are not being representing on output means", but floating point is always going to be somewhat imprecise. You can use double, or long double, to try for more precision; but if values are wildly off, you're doing something wrong. Adding 1.4 and 1.4 should give a value close to 2.8, for example. If you get something like 2.799999999999999 that's fine; but if you get, say, 3, you're likely making a mistake somewhere.

3. What I meant was that i'm entering float values into an array, sorting them, and printing the array out. The values i'm using for input are .3476789, 1004008.67, .0000099, 1.3435678, 78.345678, 54321678.567, and 22.6. However, when displayed back in ascending order they are:
0.000010
0.000000
1.345678
22.600000
78.000000
1004008.000000
54321680.000000
Sum of the array is: 55325788.000000
The sum of the Array should be : 55325789.8739346

4. Since none of us are mind readers... please post the minimum compileable segment of code that demonstrates the problem.

5. I caught an error in my sorting algorithm, I had the temp value defined as int, so that was obviously causing a problem. However, even though fixing that repaired some of the problems, there are still a few:
xArray[0]: 0.000010
xArray[1]: 0.347679
xArray[2]: 1.343568
xArray[3]: 22.600000
xArray[4]: 78.345680
xArray[5]: 1004008.687500
xArray[6]: 54321680.000000
Sum of the array is: 55325788.000000
The sum of the Array should be : 55325789.8739346

The value in xArray[6] should have output at 54321680.567..... And the two totals are 1.8739346 apart...

6. Code:
```float sumFloats(float xArray[], int numFloats){

int j, k;
float sumArray;
for(j=0; j < numFloats - 1; j++){
for(k = j + 1; k < numFloats; k++){
if(*(xArray + j) >= *(xArray + k)){
float temp = *(xArray +j);
*(xArray + j) = *(xArray + k);
*(xArray + k) = temp;
}
}
}

for(k = 0; k < numFloats; k++){
printf("xArray[%d]: %f\n", k, xArray[k]);
sumArray = sumArray + xArray[k];
}

printf("Sum of the array is: %f\n", sumArray);
printf("The sum of the Array should be : 55325789.8739346\n\n");
}```

7. Ok... one more time.... Please post your code!

8. Code:
```#include <stdio.h>
#include <stdlib.h>
void sumFloats(double xArray[], int numFloats){

int j, k;
double sumArray = 0.0;
for(j=0; j < numFloats - 1; j++){
for(k = j + 1; k < numFloats; k++){
if(*(xArray + j) >= *(xArray + k)){
double temp = *(xArray +j);
*(xArray + j) = *(xArray + k);
*(xArray + k) = temp;
}
}
}

for(k = 0; k < numFloats; k++){
printf("xArray[%d]: %f\n", k, xArray[k]);
sumArray = sumArray + xArray[k];
}

printf("Sum of the array is: %f\n", sumArray);
printf("The sum of the Array should be : 55325789.8739346\n\n");
}
int main (void)
{
double myarray[] = { .3476789, 1004008.67, .0000099, 1.3435678, 78.345678, 54321678.567, 22.6 };
sumFloats(myarray, sizeof(myarray) / sizeof(*myarray));
return 0;
}
#if 0
My output:
xArray[0]: 0.000010
xArray[1]: 0.347679
xArray[2]: 1.343568
xArray[3]: 22.600000
xArray[4]: 78.345678
xArray[5]: 1004008.670000
xArray[6]: 54321678.567000
Sum of the array is: 55325789.873935
The sum of the Array should be : 55325789.8739346
#endif```
Apart from this single change, notice that I'm using double instead of float here. The larger precision of a double seems to fix your problem. You might be pedantic and say that 55325789.873935 != 55325789.8739346, but it's damn close. If you really need a number so precise, though, I recommend using gmp instead.

9. I think it should be taught as part of the course that float gives 7 decimal digits and double gives almost 16. Basic knowledge before one consciously chooses float vs. double.

I don't think installing some third-party high precision math library is what is expected of the student to solve the problem.

Popular pages Recent additions