Mathematical Expression Produces Incorrect Result

**nul** · 03-01-2015

I've implemented the Euler method (Euler method - Wikipedia, the free encyclopedia), but what I'm actually having a problem with is evaluating the exact solution.

The following is the mathematical expression I'm trying to compute:

(t^6 - 37)^(1/3).

Here is the C function:

Code:

double exact_solution( double t ){ //exact solution function
   return pow(pow(t, 6) - 37, (1.0 / 3));
}

With an input of 3.0, the C function produces a result of: 8.783047
However a calculator produces: 8.845085422

What do you guys think?

EDIT: Okay. After testing just the expression, its not the problem. Here's the whole program.

Code:

#include <stdio.h>
#include <math.h>


double F( double t, double x ){ //slope function
   return pow(t,5) * 2 / (x * x);
}


double exact_solution( double t ){ //exact solution function
   return pow(pow(t, 6) - 37, (1.0 / 3));
}


int main(int argc, char *argv[]){


   double t = 2;           //independent var
   double delta_t = .01;   //step size
   double x_n1 = 3;        //x(0) = x_n1 ( initial value )
   double x_n2;
   double interval_size = 1;
   double exact = exact_solution(t);


   printf("step size -> %f\n\n", delta_t  );


   for(int i = 0;i < interval_size/delta_t; i++){
      x_n2 = x_n1 + delta_t * F(t, x_n1);
      x_n1 = x_n2;
      exact = exact_solution(t);
      t += delta_t;
      printf("%f %f %f %f\n", t, x_n2, exact, fabs(exact - x_n2) / exact * 100);
   }


   return 0;
}

**Salem** · 03-01-2015

Mmm, a for loop and many floating point operations.
Euler method - Wikipedia, the free encyclopedia

Also this
What Every Computer Scientist Should Know About Floating-Point Arithmetic

**nul** · 03-02-2015

Salem, thank you very much.

What has me scratching my head is that the approximation comes out exactly as printed in the back of the book. So I use the double variable "t" as the parameter for both functions; one produces the expected result, while the other doesn't.

My TA provided a spreadsheet to do the problems, but suffering from the not-invented-here syndrome, I really wanted to write my own C version.

**nul** · 03-03-2015

So I haven't looked at this since my last post. Does anybody have a possible solution I could implement?

I've thought of just calculating the exact value at pre-determined points using hard-coded constants to avoid the floating-point rounding error. I'd like to hear what you smart folks have to say.

**Nominal Animal** · 03-03-2015

nul, your issue is about understanding the mathematical method, not so much a programming issue.

That said, perhaps the following variant of the same code helps you undestand better:

Code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <math.h>

/* Slope function f(t, x) = y'(t) where x = y(t) */
static double f(const double t, const double x)
{
    const double tt = t * t;
    return 2.0 * t * tt * (tt / (x * x)); /* 2 * t^5 / x^2 */
}

/* Known exact function y(t) */
static double known_y(const double t)
{
    const double t2 = t * t;
    const double t4 = t2 * t2;
    return cbrt(t2 * t4 - 37.0);
}

int main(int argc, char *argv[])
{
    const double t0 = 2.0;  /* Starting point */
    const double y0 = 3.0;  /* Function at starting point */
    const double t1 = 3.0;  /* Ending point */
    double       h, y;
    long steps, i;
    char dummy;

    if (argc != 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s STEPS\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    if (sscanf(argv[1], " %ld %c", &steps, &dummy) != 1 || steps < 1L) {
        fprintf(stderr, "%s: Invalid number of steps.\n", argv[1]);
        return EXIT_FAILURE;
    }

    h = (t1 - t0) / (double)steps;
    y = y0;

    printf("# t f(t,y) y y(t) h\n%.9f %.9f %.9f %.9f %.9f\n", t0, f(t0, y0), y0, known_y(t0), h);
    for (i = 1L; i <= steps; i++) {
        const double phase = (double)i / (double)steps;
        const double t = (1.0 - phase) * t0 + phase * t1;
        const double fvalue = f(t, y);
        y += h * fvalue;
        printf("%.9f %.9f %.9f %.9f\n", t, fvalue, y, known_y(t));
    }

    fflush(stdout);
    fprintf(stderr, "Expected  y = %.16f\n", known_y(t1));
    fprintf(stderr, "Evaluated y = %.16f\n", y);

    return EXIT_SUCCESS;
}

The Wikipedia Euler method article has a pretty good description. Let me re-describe it, with variable names matching the above program and the article.

The problem at hand is an initial value problem. We know that

dy(t) / dt = f(t, y(t))
f(t, x) = 2 t⁵ / x²
y(2) = 3.0

and we want to find out

y(3) = ?

In other words, we don't know the function y(t), but we know its slope is f(t,x) where x=y(t). One way to approximate this is to apply the Euler method, where t=t₀+nh, and h is the number of steps between t₀=2 (the starting point) and t₁=3 (the ending point), and we know the initial value y(t₀)=y₀=3.

The above program has hardcoded the starting point (t0=2.0, t1=3.0, and y0=3.0), but the number of steps taken is supplied on the command line. The step size, h, is determined from those three (h = (t1 - t0) / steps).

Because we have an integer number of steps (iterations), I'm using a stable numeric trick to compute t at each step. The number of iterations (starting at 1) divided by the total number of steps gives us phase, a value beween 0 (exclusive) and 1 (inclusive), at very high precision. We can then calculate t = (1.0-phase)*t0 + phase*t1 to calculate t without cumulative errors. (Here, the difference to just adding h to t at each step would not make much of a difference, but if the start and end point were dozens of orders of magnitude different, then this method shines: it will yield reliable values even then.)

During each iteration, in the loop body, we precalculate t and fvalue = f(t, y), then update y=y+h*fvalue. This is equivalent to the Wikipedia article y_N+1=y_N+h×f(t_N,y_N). We use separate variables, so we can print the actual values used, as well as y_N+1, the point of the iteration.

After the loop is done, we have the approximation of y(t1) in y. To make it easier to see the effect the number of steps has on the approximation, the summary is printed to standard error (while all the intermediate values are printed to standard output). Note that the loop will run much faster, if only the end result were to be printed; printing floating-point values is kind-of slow, when you have lots of iterations.

I compiled and ran the above as euler, with different numbers of steps. The results, summarized, are as follows:

Code:

#  Steps        Approximation    h
       1  57.0000000000000000    1
       2  15.1173635453141522    0.5
       3  11.2860983089115550    0.3333333
       5  10.0388388035749934    0.2
      10   9.4014859985922055    0.1
      20   9.1159639917882256    0.05
      50   8.9518255220655405    0.02
     100   8.8981953539855372    0.01
     500   8.8556663104274111    0.002
    1000   8.8503733129637165    0.001
   10000   8.8456139814468280    0.0001
  100000   8.8451382755005721    0.00001
 1000000   8.8450907071763734    0.000001
10000000   8.8450859503674462    0.0000001

#   Exact  8.8450854218326640

nul, in your code delta_t = .01 = h, which translates to just 100 steps. As the Wikipedia article suggests, the error is roughly proportional to the step size, so you only get about two digits of precision. Using more steps (more values of t and smaller h (your delta_t) gives you more precision.

Questions?

**nul** · 03-03-2015

Thank you for the very detailed write-up/explanation along with code.

I do have a question: would you mind explaining in a little more detail why/how this method of calculating t is so accurate? Maybe a link to a resource covering it?

**Nominal Animal** · 03-03-2015

Originally Posted by nul

would you mind explaining in a little more detail why/how this method of calculating t is so accurate?

Well, it boils down to floating-point precision, and errors in floating-point operations, especially cumulative errors and domain cancellation errors. It is a method you can show has no cumulative error, and very small errors for any possible values. I claim it is precise, because I know it has very small (sources of) errors.

The numerical analysis guides and books available out there (a commonly recommended one is Numerical Recipes) explain the details. There is lots of interesting stuff there, starting from things like Kahan summation algorithm.

First, the phase math should be obvious:

0 ≤ phase ≤ 1
t(0) = t₀
t(1) = t₁
t(phase) = t₀ (1 - phase) + t₁ phase

Although mathematically

t(phase) = t₀ + phase (t₁ - t₀)

we don't want to use that, because the subtraction may lead to large domain cancellation error.

(Domain cancellation error occurs when summing or subtracting two floating-point values, with one having an exponent so much larger that the smaller one does not affect the result at all.)

Given 1 ≤ i ≤ N, where i and N are of int or long type, (double)i / (double)N yields a double-precision floating point within 0 and 1, inclusive, with minimum rounding error. Substracting that value from 1 is precise, in the sense that the two floating-point values do sum exactly to 1. Multiplying any double-precision number by a double-precision number between 0 and 1 yields a result with minimum rounding error.

The only source of possibly significant error is the addition: If the exponents in the two (multiplied) floating-point values differ enough, you get domain cancellation error: their sum is just the larger one (in magnitude). However, this is okay, as the earlier zero-to-one multiplier still gets us to the correct ballpark -- as close to the correct result as possible! Also, if the two exponents are similar, the error in the addition is neglible.

Simply put, this method or expression just minimises the sources of floating-point errors. It is precise, because there are no sources of cumulative error, and all other error sources are kept to a minimum for all possible inputs.

Sometimes the nearly-equivalent expression

Code:

const double t = ((double)(N - i) * t0 + (double)i * t1) / (double)N;

is used. I *think* it is not as precise as the phase method above, when the range is very wide (or one endpoint close to zero, and another large in magnitude), due to the (largeish) divisor. However, it does not have any sources for cumulative error, so even this one should be much more precise than using a fixed step size, and adding it to t each step, even if not as precise as the above phase method.

Now, let's compare the phase method to the fixed-step-size, additive method:

When you use a fixed-size step, any rounding error in the step size will accumulate in t instead. If you have a million steps, the error in the sum is million times the rounding error in the step size!

A practical example is always warranted.

Consider the case where the starting point is very large negative value, say t0 = -1e30 (-1,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,00 0,000,000,000,000,000), and the end point is at zero, t1 = 0.0. Let's say we use a million steps. The step size is then (0--1e30)/1000000 = 1e30/1e6 = 1e24. However, that is not exactly representable by a double-precision floating-point number; the closest representable value is 999999999999999983222784. So, if you start at t0 and just add the step size a million times, your final result is 14176573061112791040 instead of zero!

Of course, with the "phase" method, the final result in this case is zero, 0*t0 + 1*t1 = 0*(-1e30)+1*0 = 0.

Here are the numerical values comparing the two methods for the above range, using 20 steps only:

Code:

Using fixed step size            Using zero-to-one phase method
-1000000000000000019884624838656 -1000000000000000019884624838656
 -950000000000000032964142432256  -950000000000000032964142432256
 -900000000000000046043660025856  -900000000000000046043660025856
 -850000000000000059123177619456  -850000000000000059123177619456
 -800000000000000072202695213056  -800000000000000072202695213056
 -750000000000000085282212806656  -749999999999999944544724451328
 -700000000000000098361730400256  -700000000000000098361730400256
 -650000000000000111441247993856  -649999999999999970703759638528
 -600000000000000124520765587456  -600000000000000054152021409792
 -550000000000000137600283181056  -549999999999999996862794825728
 -500000000000000150679800774656  -500000000000000009942312419328
 -450000000000000163759318368256  -450000000000000023021830012928
 -400000000000000176838835961856  -400000000000000036101347606528
 -350000000000000189918353555456  -350000000000000049180865200128
 -300000000000000202997871149056  -300000000000000027076010704896
 -250000000000000216077388742656  -250000000000000004971156209664
 -200000000000000229156906336256  -200000000000000018050673803264
 -150000000000000224644237885440  -150000000000000013538005352448
 -100000000000000220131569434624  -100000000000000009025336901632
  -50000000000000215618900983808   -50000000000000004512668450816
                -211106232532992                                0

The differences tend to show up near the end of the interval. The differences also increase as the number of steps increases. The second-to-last one has already two digits less precision the phase method yields, and obviously the final values differ significantly.

Many people do not care, because this only occurs if your ranges are "difficult"; one endpoint close to zero, another very large in magnitude. They say that such ranges are "pathological": exactly the kind that shows this problem; and the fix is to "not use such ranges".

Indeed, if you compare the 20-step ranges from 0 to 12345678987654321, which is not pathological or difficult at all, just a normal range, you get

Code:

Fixed step size            Phase method
                  0                   0
    617283949382716     617283949382716
   1234567898765432    1234567898765432
   1851851848148148    1851851848148148
   2469135797530864    2469135797530864
   3086419746913580    3086419746913580
   3703703696296296    3703703696296296
   4320987645679012    4320987645679012
   4938271595061728    4938271595061728
   5555555544444444    5555555544444444
   6172839493827160    6172839493827160
   6790123443209876    6790123443209876
   7407407392592592    7407407392592592
   8024691341975308    8024691341975309  <- the only difference!
   8641975291358024    8641975291358024
   9259259240740740    9259259240740740
   9876543190123456    9876543190123456
  10493827139506172   10493827139506172
  11111111088888888   11111111088888888
  11728395038271604   11728395038271604
  12345678987654320   12345678987654320

with only one difference: the exact correct value is 8024691341975308.65 (verified using Maple), which is represented in double precision by 8024691341975309.

**nul** · 03-04-2015

Extremely informative. Thank you so much!!

Thread: Mathematical Expression Produces Incorrect Result

Thread Tools

Search Thread

Display

Mathematical Expression Produces Incorrect Result

Similar Threads

Incorrect result in Bitwise operations

incorrect result in program

Evaluate mathematical expression

Incorrect result with my program

processing of user inputted mathematical expression