debugging multi-threaded code with OpenMP

**KBriggs** · 02-21-2017

I have a piece of signal processing code that I've been working on for the past 2 years or so whenever I have some downtime. It works pretty well, and I've been slowly optimizing it, more as a way to teach myself optimization techniques than because it really needs it.

My most recent optimization was to try to parallelize parts of it using OpenMP, and it went quite well, with one exception that I have no idea how to debug. I'll describe the problem as best I can. I know the following is a bit vague and I'm not expecting anyone to solve it for me, what I am hoping for is suggestions about general ways to approach this type of debugging problem.

Part of my project uses another code library (lmfit-6.1) which I found online under an open source license and tweaked slightly to get it to work with my own code. When I run the parallelized version, everything works well and gives the same result as the serial version, except when I include calls to this library. If the library code is called, then the outcome depends (very slightly) on the number of threads used (about 1 in 3000 signals are affected), and the parallel version forced to run on 1 thread gives slightly different results from the serial version.

Here's why this is tricky: all of the results are valid. When a signal is analyzed, there are about 10 reasons the program can fail to analyze it, each with its own error code. The differences I am seeing are just in the reason that a handful of signals fail to be analyzed - instead of error 1 I get error 2 - but all of the outputs are perfectly valid.

Here are my observations so far. Again, I don't expect anyone to actually be able to solve this with the information presented, I am just hoping that my describing some of the patterns I am seeing in the behavior that someone with more experience than me will be able to suggest a common newby parallelization error that would explain it.

1. The main() function makes exactly 1 call to this nonlinear fitting library. If I set the program up so that this library does not get used, then all the problems go away. However, enclosing calls to the library in a critical region does not fix the issue. This leads me to believe that I have parallelized correctly the rest of the code, and that the offending portion is isolated to the nonlinear fitting library and my wrapper functions that use it.

2. The issue appears to be deterministic (for a given number of threads, I get the same outcome every time). This suggests to me that it's not memory trashing between threads, since I would not expect that to be deterministic between runs, though I know it doesn't conclusively rule it out.

3. This library is called inside a parallel region, and all of its arguments are either private or shared read only.

4. The library uses a handful of global variables, which are all just read only config parameters and should cause no issues having multiple threads access them. Between 3. and 4. I am unable to find any place in the code where there could be cross-contamination between threads.

Taken together, it seems that there is a cross-talk problem which occurs in some tiny proportion of loop iterations, in such a way as to be deterministic. Because the output is still technically valid (just different) in every case it’s extremely difficult to track down what could be causing it. If anyone has general advice for how to approach a problem of this nature it would be appreciated. I realize this is a pretty vague question and I don't have anything resembling a MWE, but without even knowing where to start looking for the bug I wouldn't know how to put one together.
If anyone is curious, the project is here, and the nonlinear fitting library in question is lmmin_int64.c.

**Salem** · 02-22-2017

So I figured out that
gcc -D_GNU_SOURCE -std=c99 -fopenmp -Wall *.c -lm
will compile everything.

Here are some things I suggest you look at.
1. Make some functions out of that 350 line main() of yours. Especially make the code sections you want to be parallel into separate functions so you know for sure the minimal set of local variables which need to be marked private in your #pragmas.

2. Before you even got to this point, you should have run the entire code through a memory checker such as valgrind to check for any out of bound accesses on all your allocated memory.

3. There are over 20 omp pragmas in the code. This is too many for you to suddenly realise "it doesn't work". Start with one obvious case where the code can be made parallel and work from there.

4. Are you printing any results inside parallelised sections? Because stdio (or FILE* based I/O) may not be thread safe.

Regarding global variables.
These are your globals at the moment.

Code:

$ for i in *.o ; do nm -A $i | egrep -v ' [UTt] ' ; done
lmmin_int64.o:0000000000000000 R lm_control_double
lmmin_int64.o:0000000000000060 R lm_control_float
lmmin_int64.o:0000000000000000 D lm_infmsg
lmmin_int64.o:0000000000000080 D lm_shortmsg
lmmin_int64.o:00000000000000e8 d p1.4216
lmmin_int64.o:0000000000000750 r __PRETTY_FUNCTION__.4283

It would be an idea to turn those D variables (initialised data) into read-only data.

Code:

$ git diff
diff --git a/lmmin_int64.c b/lmmin_int64.c
index a78e580..485420e 100644
--- a/lmmin_int64.c
+++ b/lmmin_int64.c
@@ -113,7 +113,7 @@ const lm_control_struct lm_control_float = {
 /*  Message texts (indexed by status.info)                                    */
 /******************************************************************************/
 
-const char* lm_infmsg[] = {
+const char* const lm_infmsg[] = {
     "found zero (sum of squares below underflow limit)",
     "converged  (the relative error in the sum of squares is at most tol)",
     "converged  (the relative error of the parameter vector is at most tol)",
@@ -128,7 +128,7 @@ const char* lm_infmsg[] = {
     "stopped    (break requested within function evaluation)",
     "found nan  (function value is not-a-number or infinite)"};
 
-const char* lm_shortmsg[] = {
+const char* const lm_shortmsg[] = {
     "found zero",
     "converged (f)",
     "converged (p)",
@@ -652,7 +652,7 @@ void lm_lmpar(const int64_t n, double* r, const int64_t ldr, const int64_t* Pivo
     int64_t i, iter, j, nsing;
     double dxnorm, fp, fp_old, gnorm, parc, parl, paru;
     double sum, temp;
-    static double p1 = 0.1;
+    static const double p1 = 0.1;
 
     /*** Compute and store in x the Gauss-Newton direction. If the Jacobian
          is rank-deficient, obtain a least-squares solution. ***/
diff --git a/lmstruct_int64.h b/lmstruct_int64.h
index a98a3c0..0ae00b1 100644
--- a/lmstruct_int64.h
+++ b/lmstruct_int64.h
@@ -113,8 +113,8 @@ extern const lm_control_struct lm_control_float;
 
 /* Preset message texts. */
 
-extern const char* lm_infmsg[];
-extern const char* lm_shortmsg[];
+extern const char* const lm_infmsg[];
+extern const char* const lm_shortmsg[];
 
 __END_DECLS
 #endif /* LMSTRUCT_H */

Which gives us

Code:

$ for i in *.o ; do nm -A $i | egrep -v ' [UTt] ' ; done
lmmin_int64.o:0000000000000000 R lm_control_double
lmmin_int64.o:0000000000000060 R lm_control_float
lmmin_int64.o:00000000000003e0 R lm_infmsg
lmmin_int64.o:0000000000000500 R lm_shortmsg
lmmin_int64.o:0000000000000848 r p1.4216
lmmin_int64.o:0000000000000850 r __PRETTY_FUNCTION__.4283

This will ensure that any shared data is really read-only.
The __PRETTY_FUNCTION__ is caused by your use of assert.

What about this?

Code:

void free_filter(bessel *lpfilter)
{
    free(lpfilter->dcof);
    free(lpfilter->ccof);
    #pragma omp parallel
    {
        free(lpfilter->temp[omp_get_thread_num()]);
    }
    free(lpfilter->temp);
    free(lpfilter);
}

Is a barrier implied before you get to free(lpfilter->temp) or not?
If there isn't a barrier, then the code is broken.

> I don't have anything resembling a MWE, but without even knowing where to start looking for the bug I wouldn't know how to put one together.
But you do have some test data files and config.txt (or whatever else is necessary to make a test run).
Anything is better than nothing.

**KBriggs** · 02-22-2017

Thank you for taking the time to put that together!

I can certainly provide test data and config files if you would like (I honestly didn't expect anyone to try to compile and run it, I was just hoping for general advice, but if you would like to I would be more than happy to provide) - the only issue there is because this bug only affects about 1 in 3000 events analyzed, the data file needs to be pretty big in order to actually see it. I'll see if I can put together a reasonably sized data set.

You're right that main() is badly in need of a cleanup, and that I probably should have done that first before parallelizing anything. I did, however, take an incremental approach to parallelization - based on my testing I am confident that the threaded and serial versions give identical results up until main.c line 296, and that any bugs manifest after that point. The bug arises in the very last major parallel block, which was the latest one I started implementing.

Good catch on free_filter(), that has been fixed. I have also made the suggested changes to the global variables.

There are some printf()s inside parallel sections, though generally just warning messages before it crashes, so in practice they don't get used. All FILE * I/O done in parallel sections has a separate pointer for each thread - threads do read from the same file, but each uses its own handle. There is a fprintf to one file by all threads in the last block, but it is in the #critical section.

Is there a good valgrind alternative for Windows? For various reasons this is being developed on a Windows machine using Code::Blocks, which I don't think has a valgrind plugin. I could certainly port it into a Linux VM for testing if there isn't one available. EDIT: there seem to be a few options, I'll look into them vs just using a Linux VM.

**KBriggs** · 02-22-2017

Apologies for double post, but I think it is warranted here.

I tried to find a small data set that would demonstrate the problem, but it seems to be a cumulative issue. Using a long data set triggers the error, but using any smaller subset of that data set does not. Whatever the issue, it seems to build over time. For that reason, the smallest amount of data I was able to find to demonstrate the issue was 6GB. I have compressed it and made it available here, along with a config file, but I completely understand if you don't want to download something that large.

If you do decide to run it, here's how to go about it:

First of all, your compilation settings are fine, and work for me on the Linux VM as well.

Edit the first two settings:
input_file=[my folder]\test-data.bin
output_folder=[my output folder]

the first one should point to the data file, and the second one should point to a folder somewhere on your machine (the folder must exist already). Inside that folder, make another folder called “events”.
On my machine, the given config file takes ~10 minutes to run in serial mode, and requires about 600MB of RAM per thread.
In the output folder after the run you will find a txt file called summary.txt. At the end of it is a summary of the analysis along with error codes, which looks like this:

Code:

 Locating events...
 
Read 720 seconds of good baseline
Read 0 seconds of bad baseline


---------------------------------
Event Summary: 3857 events detected

Success: 91 %
Failed: 9 %
---------------------------------
Event Type	Count	Percentage

0		413	10.7 %
1		3097	80.3 %
---------------------------------
2		8	0.207 %
3		19	0.493 %
4		260	6.74 %
5		0	0 %
6		0	0 %
7		31	0.804 %
8		9	0.233 %
9		20	0.519 %
---------------------------------

If I run with 8 threads vs 4 threads, for example, I get a different vount for types 8 and 9, which is the symptom of the problem I’m hunting.

To turn off the nonlinear fitting library for comparison purposes you must change the following settings:

attempt_recovery=0
stepfit_samples=0

If you would like more information about what the program actually is meant to do just let me know and I would be happy to provide a more comprehensive explanation.

On another note, I set up a Linux VM and I have the serial version of the code running through valgrind as I type this. It won't finish before I head home for the day, so I will post results of that tomorrow.

**Salem** · 02-23-2017

I managed to run the single-threaded version through valgrind.
I changed the start of main like so, to allow run-time thread selection, but I've made no changes to your config.txt, except for the paths in the first two lines.

Code:

    char *ename = "OMP_NUM_THREADS";
    char *envp = getenv(ename);
    if ( envp == NULL ) {
        int p = omp_get_num_procs();
        printf("Number of OMP procs=%d\n",p);
        omp_set_num_threads(p);
    } else {
        printf("Number of threads set by %s to %s\n",ename,envp);
    }

All I saw was one use-after-free error.

Code:

$ export OMP_NUM_THREADS=1
$ valgrind --vgdb=yes ../a.out 
==4879== Memcheck, a memory error detector
==4879== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==4879== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==4879== Command: ../a.out
==4879== 
Number of threads set by OMP_NUM_THREADS to 1
Using CUSUM version 3.1.3p

Verifying config parameters

No corrections

Done config check

==4879== Warning: set address range perms: large range [0x5f4f040, 0x19d350c0) (defined)
==4879== Warning: set address range perms: large range [0x19d36040, 0x2db1c0c0) (defined)
Locating events... 
==4879== Warning: set address range perms: large range [0x19d36028, 0x2db1c0d8) (noaccess)

Read 720 seconds of good baseline
Read 0 seconds of bad baseline
We have 7714 edges after merge
Processing 7714 edges
Thread 0 has 7714 edges


---------------------------------
Event Summary: 3857 events detected

Success: 91 %
Failed: 8.97 %
---------------------------------
Event Type	Count	Percentage

0		413	10.7 %
1		3098	80.3 %
---------------------------------
2		8	0.207 %
3		19	0.493 %
4		260	6.74 %
5		1	0.0259 %
6		0	0 %
7		31	0.804 %
8		7	0.181 %
9		20	0.519 %
---------------------------------

Cleaning up memory usage...
==4879== Invalid read of size 4
==4879==    at 0x40C442: main (main.c:384)
==4879==  Address 0x5b59248 is 7,192 bytes inside a block of size 7,352 free'd
==4879==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4879==    by 0x40C3B7: main (main.c:373)
==4879==  Block was alloc'd at
==4879==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4879==    by 0x40F1B7: calloc_and_check (utils.c:278)
==4879==    by 0x40B90C: main (main.c:52)
==4879== 
==4879== Warning: set address range perms: large range [0x5f4f028, 0x19d350d8) (noaccess)
==4879== 
==4879== HEAP SUMMARY:
==4879==     in use at exit: 6,048,240 bytes in 445 blocks
==4879==   total heap usage: 42,852 allocs, 42,407 frees, 1,171,789,408 bytes allocated
==4879== 
==4879== LEAK SUMMARY:
==4879==    definitely lost: 6,047,864 bytes in 442 blocks
==4879==    indirectly lost: 0 bytes in 0 blocks
==4879==      possibly lost: 0 bytes in 0 blocks
==4879==    still reachable: 376 bytes in 3 blocks
==4879==         suppressed: 0 bytes in 0 blocks
==4879== Rerun with --leak-check=full to see details of leaked memory
==4879== 
==4879== For counts of detected and suppressed errors, rerun with: -v
==4879== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

The invalid read relates to this code.

Code:

    free(config);  //!! freed here
    free(error_summary);
    #pragma omp parallel
    {
        free(rawsignal[omp_get_thread_num()]);
        free_baseline(baseline_stats[omp_get_thread_num()]);
    }
    free(rawsignal);
    free(baseline_stats);

    //!! used here (bad)
    if (config->usefilter || config->eventfilter)
    {
        free_filter(lpfilter);
    }

I don't know yet about all the memory leaks. It took about 4 hours to do that one run

.

I checked out the main branch and ran the same tests as with the single thread run, and got the same set of results (which is a good thing).

When running with 8 threads, things get really weird.

I've been using git to track how the results files change from one run to another. I do rm -rf ./results ; mkdir -p ./results/events between runs to see this.

The events/event_*.csv files are all over the place. Some files are overwritten, some are no longer written to, and others appear as new files.

Example git status on the results directory.

Code:

	modified:   results/events/event_00000001.csv
	deleted:    results/events/event_00000002.csv
	deleted:    results/events/event_00000003.csv
	deleted:    results/events/event_00000004.csv
	deleted:    results/events/event_00000005.csv
	modified:   results/events/event_00000006.csv
	deleted:    results/events/event_00000007.csv
	deleted:    results/events/event_00000008.csv
	deleted:    results/events/event_00000009.csv
	modified:   results/events/event_00000010.csv
<<snipped>>
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	results/events/event_00000102.csv
	results/events/event_00000103.csv
	results/events/event_00000104.csv
<<snipped>>

But if I do cat event*.csv | sort -t, -n > merged.csv, the totality of the combined results compare the same as the main branch unmodified code. This I take to be a good thing.

But the top-level events.csv and rate.csv seem completely out of sorts. I don't know what is relevant in either of these files to know if program errors are manifesting themselves in the result data.

**KBriggs** · 02-23-2017

My valgrind run on the serial version revealed one memory leak, which I fixed (and certainly doesn't matter to the outcome). I am running it again to make sure I fixed it. My compilation and valgrind commands were

Code:

gcc -o CUSUM.out -D_GNU_SOURCE -O0 -g -std=c99 -fopenmp -Wall *.c -lm
valgrind --leak-check=full ./CUSUM.out

Code:

==22847== Invalid read of size 4
==22847==    at 0x40C5AD: main (main.c:289)
==22847==  Address 0x572ac58 is 7,192 bytes inside a block of size 7,352 free'd
==22847==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22847==    by 0x40C578: main (main.c:283)
==22847==  Block was alloc'd at
==22847==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22847==    by 0x40E0AC: calloc_and_check (utils.c:165)
==22847==    by 0x40B3C5: main (main.c:39)
==22847== 
==22847== Warning: set address range perms: large range [0x5b29028, 0x1990f0d8) (noaccess)
sh: 1: pause: not found
==22847== 
==22847== HEAP SUMMARY:
==22847==     in use at exit: 116,320 bytes in 20 blocks
==22847==   total heap usage: 42,832 allocs, 42,812 frees, 1,171,666,400 bytes allocated
==22847== 
==22847== 116,320 bytes in 20 blocks are definitely lost in loss record 1 of 1
==22847==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22847==    by 0x40E0AC: calloc_and_check (utils.c:165)
==22847==    by 0x40CB21: step_response (stepfit.c:83)
==22847==    by 0x40C292: main (main.c:251)
==22847== 
==22847== LEAK SUMMARY:
==22847==    definitely lost: 116,320 bytes in 20 blocks
==22847==    indirectly lost: 0 bytes in 0 blocks
==22847==      possibly lost: 0 bytes in 0 blocks
==22847==    still reachable: 0 bytes in 0 blocks
==22847==         suppressed: 0 bytes in 0 blocks
==22847== 
==22847== For counts of detected and suppressed errors, rerun with: -v
==22847== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

The results differences you see are not a problem, I think - I should have added that in between any two runs you need to manually empty the "events" folder. event_xxxx.csv is only overwritten if the event with index xxxx is successfully analyzed, and otherwise any old results in that folder just stay there. Because different events get fed to different threads in the parallel version, the index used for that event will be different every run, so it will look like files are changing when in fact they are just named in a different order. The differences you see in events.csv and rate.csv are due to ordering - the order in which lines are written to those files depends on which thread gets there first (it's in a critical block so that only one thread writes at a time, but order is not controlled explicitly), so between any two runs with >1 thread the rows of that file will be in a different order. That's also not a problem, unless I misunderstand your explanation. In order to compare those files between runs, sort rows by the start_time_s column first and it should make more sense.

The invalid read is due to this combination of lines at the end of main(), as you noted:

Code:

    free(config);
    fr...
    if (config->usefilter || config->eventfilter)
    {
        free_filter(lpfilter);
    }

so that has also been fixed and pushed in the serial version. My next steps will be to verify that these issues were fixed, and then I will move to running the same tests on the parallel version (with helgrind, I guess?)
However, this certainly couldn't affect the outcome of the run since it happens after everything is finished processing and I'm just cleaning up the memory usage.

It looks like your valgrind run showed more memory leaks than mine did. Not sure why, yet.

**Salem** · 02-24-2017

Helgrind would be good, but could be a lot of effort -> Valgrind

Runtime support library for GNU OpenMP (part of GCC), at least for GCC versions 4.2 and 4.3. The GNU OpenMP runtime library (libgomp.so) constructs its own synchronisation primitives using combinations of atomic memory instructions and the futex syscall, which causes total chaos since in Helgrind since it cannot "see" those.

Fortunately, this can be solved using a configuration-time option (for GCC). Rebuild GCC from source, and configure using --disable-linux-futex. This makes libgomp.so use the standard POSIX threading primitives instead. Note that this was tested using GCC 4.2.3 and has not been re-tested using more recent GCC versions. We would appreciate hearing about any successes or failures with more recent versions.

Rebuilding an old compiler seems like work.

Apart from differences in timing in rate.csv and events.csv, I see no differences at all in the summary.txt, baseline.csv or the merged events/*.csv files.

**KBriggs** · 02-24-2017

Ah. That does seem like it might be overkill at this point. I'll look into whether or not that problems applies to later versions of gcc - they mention 4.2 and 4.3 but I think I am using 5.x at the moment.

That's very odd that you don't see differences. I need to do some differently-threaded runs on the Linux VM and compare to windows, apparently. For that data set and config file I get differences in summary.txt on windows, but I have not yet run the same tests on linux since I've had valgrind going instead. There's nothing in the code that is OS-specific as far as I know, and the only even remotely non-standard functions are the fopen64() family...

Getting OpenMP working on the windows box was a bit of a pain (I finally got it through TDM-GCC MinGW Compiler download | SourceForge.net) so I am using slightly different compilers between the two OSes and if I recall correctly the OpenMP implementations are different between the two OSes... I really hope that's not the issue.

If I can't figure this out here the right move might be to back up and refactor the serial version of the code (clean up main(), mostly) before trying again.

**Salem** · 02-24-2017

Memory leak updates.

Code:

// signal = calloc_and_check(nthreads, sizeof(double *), "Cannot allocate signal");  // line 120
==14908== 8 bytes in 1 blocks are definitely lost in loss record 2 of 8
==14908==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14908==    by 0x40F097: calloc_and_check (utils.c:278)
==14908==    by 0x40C722: main._omp_fn.1 (main.c:120)
==14908==    by 0x514ECBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14908==    by 0x40BBCC: main (main.c:113)
==14908== 

// edge_array_head = calloc_and_check(nthreads, sizeof(edge *), "Cannot allocate head edge");  // line 123
==14908== 8 bytes in 1 blocks are definitely lost in loss record 3 of 8
==14908==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14908==    by 0x40F097: calloc_and_check (utils.c:278)
==14908==    by 0x40C794: main._omp_fn.1 (main.c:123)
==14908==    by 0x514ECBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14908==    by 0x40BBCC: main (main.c:113)
==14908== 

// edge_array_current = calloc_and_check(nthreads, sizeof(edge *), "Cannot allocate current edge"); // line 124
==14908== 8 bytes in 1 blocks are definitely lost in loss record 4 of 8
==14908==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14908==    by 0x40F097: calloc_and_check (utils.c:278)
==14908==    by 0x40C7BA: main._omp_fn.1 (main.c:124)
==14908==    by 0x514ECBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14908==    by 0x40BBCC: main (main.c:113)
==14908== 

// step_response(current_event, risetime, config->maxiters, config->cusum_minstep);  // line 332 in main
// time = calloc_and_check(length, sizeof(double), "cannot allocate stepfit time array"); // line 83 in stepfilt
// you do have free(time); but this can be missed by the early return at line 137
==14908== 116,320 bytes in 20 blocks are definitely lost in loss record 7 of 8
==14908==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14908==    by 0x40F097: calloc_and_check (utils.c:278)
==14908==    by 0x40D855: step_response (stepfit.c:83)
==14908==    by 0x40D0FD: main._omp_fn.3 (main.c:332)
==14908==    by 0x514ECBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14908==    by 0x40C171: main (main.c:276)
==14908== 

// cusum(current_event, config->cusum_delta, config->cusum_min_threshold, config->cusum_max_threshold, config->subevent_minpoints); // line 330 in main
// cpos = calloc_and_check(length, sizeof(double),"Cannot allocate cpos");  // line 365 in detector
// line 427 in detector has 3 free's, but misses cpos!
==14908== 5,931,520 bytes in 419 blocks are definitely lost in loss record 8 of 8
==14908==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14908==    by 0x40F097: calloc_and_check (utils.c:278)
==14908==    by 0x403864: cusum (detector.c:365)
==14908==    by 0x40D07B: main._omp_fn.3 (main.c:330)
==14908==    by 0x514ECBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14908==    by 0x40C171: main (main.c:276)

**KBriggs** · 02-24-2017

I have fixed and pushed all of those memory leaks, thanks. That last one is odd - if you look at the master branch in that spot you will see that cpos is free'd correctly. Must have been some careless editing at some point.

I ran valgrind on my end on a 4-threaded run and picked up tons of memory leaks which don't make much sense to me. Some googling suggests that valgrind changes the scheduling of threads from a regular run, so is it possible that those memory errors are just artefacts of valgrind interfering with OpenMP? Currently running valgrind on the parallel branch with 1 thread to compare after fixing the memory leaks you pointed out.

**KBriggs** · 02-27-2017

A few more memory leaks, this time picked up using a 4-threaded run. I've been unable to figure out where the problems actually are, since the lines in main.c which valgrind indicates are problematic (39 and 61 in particular) don't make much sense to me. The last one (main.c 383 -> bessel.c 289) also doesn't make any sense to me. As far as I can tell, free_filter() free()s every possible pointer associated with the bessel struct.

Code:

==5786== 
==5786== HEAP SUMMARY:
==5786==     in use at exit: 3,360 bytes in 8 blocks
==5786==   total heap usage: 42,880 allocs, 42,872 frees, 3,421,908,952 bytes allocated
==5786== 
==5786== 8 bytes in 1 blocks are still reachable in loss record 1 of 6
==5786==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x514B778: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x5154647: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x5149DE1: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x40104E9: call_init.part.0 (dl-init.c:72)
==5786==    by 0x40105FA: call_init (dl-init.c:30)
==5786==    by 0x40105FA: _dl_init (dl-init.c:120)
==5786==    by 0x4000CF9: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
==5786== 
==5786== 40 bytes in 1 blocks are still reachable in loss record 2 of 6
==5786==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x4C2FDEF: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x514B7C8: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x515341A: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x514ECB9: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x40B97E: main (main.c:61)
==5786== 
==5786== 176 bytes in 1 blocks are still reachable in loss record 3 of 6
==5786==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x514B778: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x5153A3A: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x514B9DC: omp_set_num_threads (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x40B896: main (main.c:39)
==5786== 
==5786== 192 bytes in 1 blocks are still reachable in loss record 4 of 6
==5786==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x514B778: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x5152ECA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x514ECB9: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x40B97E: main (main.c:61)
==5786== 
==5786== 864 bytes in 3 blocks are possibly lost in loss record 5 of 6
==5786==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x40136D4: allocate_dtv (dl-tls.c:322)
==5786==    by 0x40136D4: _dl_allocate_tls (dl-tls.c:539)
==5786==    by 0x536D2AE: allocate_stack (allocatestack.c:588)
==5786==    by 0x536D2AE: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==5786==    by 0x515299F: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x514ECB9: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x40B97E: main (main.c:61)
==5786== 
==5786== 2,080 bytes in 1 blocks are still reachable in loss record 6 of 6
==5786==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5786==    by 0x514B778: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x5152489: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x514ECA5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==5786==    by 0x402A92: free_filter (bessel.c:289)
==5786==    by 0x40C43E: main (main.c:383)
==5786== 
==5786== LEAK SUMMARY:
==5786==    definitely lost: 0 bytes in 0 blocks
==5786==    indirectly lost: 0 bytes in 0 blocks
==5786==      possibly lost: 864 bytes in 3 blocks
==5786==    still reachable: 2,496 bytes in 5 blocks
==5786==         suppressed: 0 bytes in 0 blocks
==5786== 
==5786== For counts of detected and suppressed errors, rerun with: -v
==5786== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Any ideas where those are coming from?

**Salem** · 02-27-2017

All those leaks are in the various libraries you use, for example GOMP_parallel()

Leaks which have calloc_and_check() in the call chain are the ones you need to worry about, and all of those are now fixed.

Since they all seem to be one-off kind of reports, it suggests a library creating a table to store resources allocated later on. Otherwise you would be seeing a lot more leaks if it were on a 'per use basis'.

> I've been unable to figure out where the problems actually are, since the lines in main.c which valgrind indicates are problematic (39 and 61 in particular)
Line 39 is the first call to the library
Line 61 is the first use of a pragma

**KBriggs** · 02-27-2017

So, basically, out of my control. Good to know.

Thanks for your help fixing the memory issues. There's still the main bug to hunt down, but it's in a much better place to be able to diagnose than it was a week ago. I'll update if I find anything, otherwise I may step back and refactor the serial version before trying again. Obviously main.c needs a cleanup, do you have any other style/best practices advice based on the code you've seen that I should address?

**Salem** · 02-28-2017

Update on the results.
I'm now seeing a difference between the master and parallel branch code, which I think is in line with what you're seeing.

Code:

$ diff results results_1
Common subdirectories: results/events and results_1/events
diff results/rate.csv results_1/rate.csv
1375c1375
< 1373,9,224.901846,224.901865
---
> 1373,8,224.901846,224.901865
3154c3154
< 3152,9,570.680001,570.680065
---
> 3152,8,570.680001,570.680065
diff results/summary.txt results_1/summary.txt
1c1
< Using CUSUM version 3.1.3
---
> Using CUSUM version 3.1.3p
94,95c94,95
< 8		7	0.181 %
< 9		20	0.519 %
---
> 8		9	0.233 %
> 9		18	0.467 %

I also repeated the tests with 1 to 8 threads and got 8 identical sets of results.
For me at the moment, the number of threads doesn't matter, but threaded is very slightly different to non-threaded.

I also got the very same leak report as your post #11 when I added more leak flags to valgrind.

**KBriggs** · 02-28-2017

Yes, that's what I am seeing as well. I am rerunning the threaded comparisons (on a much bigger data set, to amplify errors if they are present) with the most recent code to see if the number of threads no longer matters for me. I will update later today with results.

Thread: debugging multi-threaded code with OpenMP

Thread Tools

Search Thread

Display

debugging multi-threaded code with OpenMP

Similar Threads

Multi-threaded mayhem

Going from single-threaded to multi-threaded

multi threaded program

multi-threaded programming

Multi-Threaded Sockets

Tags for this Thread