debugging

Printable View

08-20-2009
bean66

debugging

Im stumped with a debugging problem, maybe you can help?

Here is a good description of the problem without code:

Process one creates a file with records. Each record has a record type and has a size associated with it along with possibly variable length data on some records. The records are written out to a compressed stream using Zlib (gzopen, gzwrite).

Another process transmits the file to another platform. (Totally different hardware and OS).

The third process which has the bug. Opens the file using gzopen and processes the records. However occasionally it gets an unknown record ie one that is not defined and has a large number like 115 or 101, when there are only 9 distinct records from 1-9.

What is strange is that after a restart of the process it is able to correctly read and interpret the file.

So that leads me to believe the file is intact and something else is happening internally.

I've run the code through valgrind, helgrind etc with nothing to report. The process itself is multi-threaded.

So any ideas on how to find whats causing this?

Thanks Ken
08-20-2009
Elysia

You need a debugger. Step through the code that reads the file.
Look at variables and see if they reflect information that you would expect them to.
Look to make sure you're in the correct offset of the file, etc.
08-20-2009
Maz

how is the file updating and reading synchronized? This sounds a bit like synchronization bug. Can you repeat the sequence leading to this bug? If so, does it always occur at same phase? Is the garbage value alvays the same? Does the bug occur more often, if you generate some background load to the machine where reading occurs?
08-20-2009
bean66

The garbage value is either 101 or 115.
And generally happens early in the file and sometimes in the same location.
But once the process stops and i restart then it processes fine.

Yes i've used a debugger on the process everything looks ok, except for the fact that the record type is bad and everything else in the structure is bad.
Yes it seems that if there is more background load that the error will happen more often.

Any ideas on how to track this down?

I really suspect some form of heap corruption since the variable is declared static.
08-20-2009
Salem

What about all the basics, like checking ALL status returns and such like?
If some file read function returns a length, do you blindly assume success or use the length?
08-21-2009
Elysia

If you know what is causing the corruption, then it's time to find the source of the corruption.
Use a memory breakpoint to break whenever the data gets "corrupted". You can, for example, use a known "good" value for debugging purposes and break if you get another value inside a member of the struct.
08-21-2009
bean66

Thanks for your input everyone.

I think i've found at least part of the issue. I found a failed memory allocation. But after looking at the code (using a debugger) i found that possibly an endian conversion is failing or failing to execute, causing a very very long length for the read.

I'm in the process of adding code to check for the error and code to allow a breakpoint call once a memory type error occurs at that location.

But I'm still unclear why the conversion from 48000000 to 00000048 never occurred in the first place. I'll keep you posted as to my findings.

Ken
08-21-2009
Maz

Also always do some extra carefull checking for synchronizations if you have static variables used.
re-entrancy may easily bite your butt with static vars used in multithreaded apps.