debugging multi-threaded code with OpenMP

**KBriggs** · 02-28-2017

OK, an update on results with a larger data set. This makes less and less sense.

on Windows 7, using x86_64-w64-mingw32-gcc (mingw64) to compile without any optimizations, my results DO depend on the number of threads. I'll spare you the details, but I get slightly different error counts using 8,7,4,2 and 1 thread. The 1-threaded parallel version matches the serial version, however. It is slightly better than it was before I started this process (1 event in about 5000 is affected, versus 1 in 3000 before).

On Linux, using gcc to compile, the results DO NOT depend on the number of threads (1,2, and 4 tested). The serial version also matches all of the parallel versions. Perhaps even weirder, the serial version on Linux does not match the serial version on Windows (?!) despite being literally identical code.

The line that is causing problems (main.c: 252) uses only either private or read only data. I am about as certain as I can be that there is no cross-talk possible between threads inside this function. And yet if I set up the config file so that this doesn't get used at all, all differences disappear on windows.

Basically: https://media.giphy.com/media/WM3HX2cZ3zTry/giphy.gif

Is it possible that I am dealing with a bug in the OpenMP implementation that ships with mingw64? Even that wouldn't explain differences between serial versions on each platform.

My next test will be to cross-compile for windows on my linux box and run the same tests. The cross-compiler is still mingw64 but maybe there are differences between the windows and linux packages. I'm pretty much at a loss.

**Salem** · 02-28-2017

My suggestion would be to rewrite this function in long hand.

Code:

int signum(double num)
{
    return (EPS<num)-(num<-EPS);
}

This is the kind of cheap trick which seems to work in theory, and on small numbers, but which has a nasty habit of failing miserably near the limits.

**KBriggs** · 02-28-2017

what do you mean by long hand? Something like:

Code:

int signum(double num)
{
    if (num > EPS)
    {
        return 1;
    }
    else if (num < -EPS)
    {
        return -1;
    }
    else
    {
        return 0;
    }
}

?

**Salem** · 02-28-2017

Yes, if that's what you expect it to do.

**KBriggs** · 02-28-2017

OK, I will test that as well.

Out of curiosity, where is the problem with the function as implemented before? (num>EPS) and (num<-EPS) are both just either 1 or 0, so I'm not sure where the floating point problems would arise.

EDIT: quick update - the cross-compiled results matched the windows 7 natively compiled results for a given number of threads, so that much at least makes sense, since they are the same compiler.

EDIT 2: the change to signum did not change the outcome

**KBriggs** · 03-01-2017

I think the fact that I get different results with the serial program using gcc vs mingw64 is probably telling. I am starting to suspect that the differences with different thread counts are a red herring and that I'm dealing with something on the level of the compiler (maybe these functions are defaulting to fopen() without throwing an error somehow?) leading to undefined behavior, presumably on the windows/mingw64 side only, since the linux/gcc gives the same results across thread counts. Do you agree?

**Salem** · 03-06-2017

So back to reading the code, I found another bug.

Code:

    /* Initialize diag. */
    if (!C->scale_diag)
        for (j = 0; j < n; j++)
            diag[j] = 1;

scale_diag is set to 1 in lm_control_double.
That pesky ! means the diag array isn't being initialised.

malloc doesn't fill the memory with zeros or anything, it's just random junk from whatever was there last time. It's usually zero the first time it's allocated, but if you get a block that was the result of a previous free(), then it's just how it was.

In the debugger

Code:

261	    /* Initialize diag. */
262	    if (!C->scale_diag)
263	        for (j = 0; j < n; j++)
264	            diag[j] = 1;
265	
266	    /***  Evaluate function at starting point64_t and calculate norm.  ***/
(gdb) p C->scale_diag
$1 = 1
(gdb) p n
$2 = 7
(gdb) p *diag@n
$3 = {1.3980432860952889e-76, 1.3980432860952889e-76, 1.3980432860952889e-76, 1.3980432860952889e-76, 
  1.3980432860952889e-76, 1.3980432860952889e-76, 1.3980432860952889e-76}
(gdb) n
268	    if (C->verbosity) {

With that fixed, I get identical results in the sequential, and the 1 to 8 threads versions.

**KBriggs** · 03-07-2017

nice find!

I also found a whole bunch of unsafe floating point comparisons (== and !=) in that file, which I fixed in a commit a couple of days ago. I'm testing at my end after removing the ! you indicated, and I'll let you know how it goes.

EDIT: I still get different results using gcc versus MinGW-w64 using the serial version. Only 1 event in 15,887 is affected, so the data set I gave you is probably too small to reveal it. Shame, I was excited, too. I'll look for more initialization problems.

I've reorganized main.c in the master branch as per your advice. It means I won't be able to merge the parallelized branch automatically, but I think I will restart that anyway in the new architecture once I've nailed down this bug.

EDIT 2: It turns out that changing all the doubles in stepfit.c and related files to long doubles changes the outcome even using the same compiler. It's possible that what this bug is, is floating point rounding errors. Nonlinear fitting is "chaotic" in the sense that it is extremely sensitive to initial conditions. Subtle differences in how the different compilers handle floating point calculations could compound into divergent results over time. Is that a realistic possibility?

**KBriggs** · 03-08-2017

Further investigation found a whole bunch of unsafe floating point operations in lmmin_int64.c. Stuff like

Code:

for (j = 0; j < n; j++) {
            temp3 = fjac[j*m+j];
            if (d_abs(temp3) > EPS) {
                sum = 0;
                for (i = j; i < m; i++)
                    sum += fjac[j*m+i] * wf[i];
                temp = -sum / temp3;
                for (i = j; i < m; i++)
                    wf[i] += fjac[j*m+i] * temp;

The cumulative sums regularly add tiny numbers which are below the resolution of floating point precision. I'm sure there are lots more. It appears to not matter most of the time, but fixing it properly would probably require rewriting that whole library. Floating point differences near the limits of resolution could certainly explain behavior differences between compilers. Not quite sure how different threads come into it, yet.

In further support of this theory, the one event that comes out different for me is a corner case which will by nature have very poor performance in non-linear fitting which would tend to make it more sensitive to floating point errors.

**Salem** · 03-08-2017

The first of many -f options for controlling floating point operations in GCC.

The following options control compiler behavior regarding floating-point arithmetic. These options trade
off between speed and correctness. All must be specifically enabled.

-ffloat-store
Do not store floating-point variables in registers, and inhibit other options that might change
whether a floating-point value is taken from a register or memory.

This option prevents undesirable excess precision on machines such as the 68000 where the floating
registers (of the 68881) keep more precision than a "double" is supposed to have. Similarly for the
x86 architecture. For most programs, the excess precision does only good, but a few programs rely on
the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying
them to store all pertinent intermediate computations into variables.

**KBriggs** · 03-09-2017

I thought that fixed it but later realized I had changed an important setting so the test was not valid. I am rerunning all my comparisons again, I will update shortly.

**Salem** · 03-09-2017

If you have
double var1 = some_calculation;
double var2 = some_calculation_with_var1;

Without -ffloat-store, var1 may (or may not) exist as an extended precision value in some floating point register when it comes to calculating var2.

With -ffloat-store, var1 will have exactly double precision when it's used in the calculation of var2.

> That does it. With -ffloat-store specified I get identical output between gcc and mingw.
Bear in mind that mingw is just a version of GCC.
Run 'gcc --version' on both machines to find out the respective versions.

**KBriggs** · 03-09-2017

So rerunning with the previous settings resulted in -ffloat-store not doing anything after all. What changed before was that I changed the value of EPS (defined in utils.h) to 1e-10 from 1e-30. From the perspective of lmmin_int64.c, that basically means taking a much more liberal definition of "zero" since for all of the branching logic in that file I fixed the floating point unsafe comparisons by comparing the absolute value of all of the parameters to EPS instead of zero. Basically this goes in the same direction - rounding errors causing issues in floating point calculations.

I'm playing with some other options at the moment (-mfpmath=sse -msse2 -march=corei7) to see what differences it causes, if any. But the EPS change is encouraging to me that this really is floating point issues rather than any sort of undefined behavior.

My linux box is using gcc 5.4.0, while my windows box uses 5.1.0.

**KBriggs** · 03-10-2017

The other compiler flags did not really do anything either, but using a more liberal definition of zero seems to make the difference.

I want to thank you for taking the time to help be understand this issue - you went above and beyond to help a complete stranger debug their code. Not a very satisfying conclusion and one that I probably should have tried sooner, but at least I now know it's not undefined behavior or some memory issue waiting to screw me up badly down the road. I led us on a wild goose chase with OpenMP, but I probably would never have even realized there was a problem if I hadn't.

Can I acknowledge you somehow in the code repo? I'd be happy to add an acknowledgements page or something along those lines if you would like.

**Salem** · 03-10-2017

> Can I acknowledge you somehow in the code repo?
You could put a URL to this thread in the docs if you want

> Only 1 event in 15,887 is affected, so the data set I gave you is probably too small to reveal it.
I was going to suggest this.

Code:

$ git diff
diff --git a/stepfit.c b/stepfit.c
index 6a03f30..fed4762 100644
--- a/stepfit.c
+++ b/stepfit.c
@@ -114,6 +114,21 @@ void step_response(event *current, double risetime, int64_t maxiters, double min
         control.patience = maxiters;
         lm_status_struct status;
 
+        if ( event->index == 0 ) {  // replace 0 with the magic number of the bad event
+            FILE *dump = fopen("dump.dat","wb");
+            fwrite(&n,sizeof(n),1,dump);
+            fwrite(par,sizeof(par[0]),n,dump);
+            fwrite(&length,sizeof(length),1,dump);
+            fwrite(time,sizeof(time[0]),length,dump);
+            fwrite(current->signal,sizeof(current->signal[0]),length,dump);
+            fwrite(&maxlength,sizeof(maxlength),1,dump);
+            fwrite(&maxstep,sizeof(maxstep),1,dump);
+            fwrite(&maxbaseline,sizeof(maxbaseline),1,dump);
+            fwrite(&risetime,sizeof(risetime),1,dump);
+            fwrite(&sign,sizeof(sign),1,dump);
+            fclose(dump);
+        }
+
         lmmin_int64(n, par, length, (const void*) &data, evaluate, &control, &status );

Then a much simplified main() to load dump.dat into appropriate buffers, then call lmmin_int64() would permit much quicker turn around on experiments and analysis.