I have a piece of signal processing code that I've been working on for the past 2 years or so whenever I have some downtime. It works pretty well, and I've been slowly optimizing it, more as a way to teach myself optimization techniques than because it really needs it.

My most recent optimization was to try to parallelize parts of it using OpenMP, and it went quite well, with one exception that I have no idea how to debug. I'll describe the problem as best I can. I know the following is a bit vague and I'm not expecting anyone to solve it for me, what I am hoping for is suggestions about general ways to approach this type of debugging problem.

Part of my project uses another code library (lmfit-6.1) which I found online under an open source license and tweaked slightly to get it to work with my own code. When I run the parallelized version, everything works well and gives the same result as the serial version, except when I include calls to this library. If the library code is called, then the outcome depends (very slightly) on the number of threads used (about 1 in 3000 signals are affected), and the parallel version forced to run on 1 thread gives slightly different results from the serial version.

Here's why this is tricky: all of the results are valid. When a signal is analyzed, there are about 10 reasons the program can fail to analyze it, each with its own error code. The differences I am seeing are just in the reason that a handful of signals fail to be analyzed - instead of error 1 I get error 2 - but all of the outputs are perfectly valid.

Here are my observations so far. Again, I don't expect anyone to actually be able to solve this with the information presented, I am just hoping that my describing some of the patterns I am seeing in the behavior that someone with more experience than me will be able to suggest a common newby parallelization error that would explain it.


1. The main() function makes exactly 1 call to this nonlinear fitting library. If I set the program up so that this library does not get used, then all the problems go away. However, enclosing calls to the library in a critical region does not fix the issue. This leads me to believe that I have parallelized correctly the rest of the code, and that the offending portion is isolated to the nonlinear fitting library and my wrapper functions that use it.

2. The issue appears to be deterministic (for a given number of threads, I get the same outcome every time). This suggests to me that it's not memory trashing between threads, since I would not expect that to be deterministic between runs, though I know it doesn't conclusively rule it out.

3. This library is called inside a parallel region, and all of its arguments are either private or shared read only.

4. The library uses a handful of global variables, which are all just read only config parameters and should cause no issues having multiple threads access them. Between 3. and 4. I am unable to find any place in the code where there could be cross-contamination between threads.

Taken together, it seems that there is a cross-talk problem which occurs in some tiny proportion of loop iterations, in such a way as to be deterministic. Because the output is still technically valid (just different) in every case it’s extremely difficult to track down what could be causing it. If anyone has general advice for how to approach a problem of this nature it would be appreciated. I realize this is a pretty vague question and I don't have anything resembling a MWE, but without even knowing where to start looking for the bug I wouldn't know how to put one together.
If anyone is curious, the project is here, and the nonlinear fitting library in question is lmmin_int64.c.