OK, an update on results with a larger data set. This makes less and less sense.

on Windows 7, using x86_64-w64-mingw32-gcc (mingw64) to compile without any optimizations, my results DO depend on the number of threads. I'll spare you the details, but I get slightly different error counts using 8,7,4,2 and 1 thread. The 1-threaded parallel version matches the serial version, however. It is slightly better than it was before I started this process (1 event in about 5000 is affected, versus 1 in 3000 before).

On Linux, using gcc to compile, the results DO NOT depend on the number of threads (1,2, and 4 tested). The serial version also matches all of the parallel versions. Perhaps even weirder, the serial version on Linux does not match the serial version on Windows (?!) despite being literally identical code.

The line that is causing problems (main.c: 252) uses only either private or read only data. I am about as certain as I can be that there is no cross-talk possible between threads inside this function. And yet if I set up the config file so that this doesn't get used at all, all differences disappear on windows.

Basically: https://media.giphy.com/media/WM3HX2cZ3zTry/giphy.gif

Is it possible that I am dealing with a bug in the OpenMP implementation that ships with mingw64? Even that wouldn't explain differences between serial versions on each platform.

My next test will be to cross-compile for windows on my linux box and run the same tests. The cross-compiler is still mingw64 but maybe there are differences between the windows and linux packages. I'm pretty much at a loss.