PDA

View Full Version : Debugging a rare / unreproducible bug..



g4j31a5
07-21-2008, 09:44 PM
Dunno where else to ask this because it's not quite a programming question. Like the title says, how do I debug a rare / unreproducible bug? My application is having some weird behaviours. It sometimes crashed and went back to Windows and displayed the usual error message with those send / don't send button (you know what I'm talking about), yet sometimes it would run normally. Also because the application needed to be run at least for one whole day, I also needed the application to be robust. And there's one more bug that would happen when I tested it to be run for 1 whole day. It would make go black. I'm quite sure that it's not a screen saver / hardware safe mode issue because when I pressed CTRL+ALT+DEL, the task manager shows up. So, can anybody help me here? How do you usually detect a bug that happens randomly like this? How do you usually test for an application robustness? Thanks in advance.

@nthony
07-21-2008, 10:01 PM
Code review. It's probably a buffer overrun somewhere or null pointer dereference. Some may suggest a debugger, but I don't think you're at that point yet. Instead, try looking through your code for "off-by-one" errors, bad loop conditions, etc, generally go through all major constructs and/or functions and think to yourself "under what circumstances can this possibly go wrong?". And short from random bit-flipping caused by cosmic rays, try to code for a contingency.

g4j31a5
07-21-2008, 10:19 PM
Have done that but can't seem to see anything weird. BTW, one more thing that I don't get was the treatment between when it crashed and when it doesn't is the same, nothing at all. No input from any external source, even keyboard. Just the application that runs its own routines all day long. At least for now. Maybe the problem is a call to a null pointer just like you said but I don't know which one because the object creation / deletion is automated from the application itself with some sort of schedulers.

Salem
07-21-2008, 11:44 PM
Is this the debug build or the release build?

Even if it's the release build, compile it with debug information.

Install WinDbg (http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx) and just run the program from within the debugger (no breakpoints or anything, just run it).
If it does crash, at least you'll find yourself inside the debugger, and not at a meaningless dialog going nowhere.

jEssYcAt
07-21-2008, 11:47 PM
Make sure you are setting pointers to NULL once you free the memory they point to. If you do
int *ptr = malloc(MAGIC_NUMBER); and then later free(ptr);, and then dereference ptr before setting it to NULL or re-assigning it to some other chunk of memory, you are likely to experience the exact problem you are describing. ptr is a dangling pointer once you free it and before you re-assign it.

Once you call free, anything can happen to that block of memory, from nothing to the operating system reclaiming it for another process. If you get "lucky" it will be left alone, even after additional calls to malloc or new. For instance, malloc's algorithm could be skipping this block for some reason, so every time you dereference that dangling pointer, you happen to be referring to the old value and everything seems to work.

Later (the next few cycles, an hour later, however long it takes for you to call malloc enough that it arrives back at that particular block of memory) the memory is finally overwritten with something else. Now when you dereference ptr, it might give weird results, or it might cause a seg fault, or it might continue operating with no obvious effect. Since you have no way to know what the memory block was overwritten with, you have no way to know how or why it is acting that way (unless you use a debugger and look at the value of the block ptr is pointing to).

So, while you don't see any difference between the first run and the second, the algorithm malloc is using might be taking different paths, or the operating system might be shuffing memory around and that dangling pointer is left out of loop since the memory it is pointing to is technically pronounced available, etc.

CornedBee
07-22-2008, 04:57 AM
Also, if you have access to a code linter, run it over the program and pay attention to its warnings.

matsp
07-22-2008, 05:16 AM
And adding logging to the code would also help - showing what the program is doing - even if it's not showing where it actually goes wrong or why, it will be very helpful to understand what the steps are to reproduce the problem, so recording each user-action and/or data-input would be useful - perhaps you will then notice something that is different between the crashing and non-crashing scenarios of performing the same steps.

--
Mats

medievalelks
07-22-2008, 06:48 AM
Have you run Bounds Checker or something similar on your program?

abachler
07-22-2008, 08:28 AM
Are you using threads?

Remember that under windows, you have to CloseHandle() on a thread after it finishes, or you leak handles and eventually the OS will refuse to give you any more of them. A simple way to check if this is the case is to look in task manager under the performance tab and see if the handle count is gradually creeping up.

g4j31a5
07-22-2008, 08:44 PM
Lots of replies already. Thanks guys. I'll try to answer them all at once.

@Salem: It's the release build. Now that I think about it, the debug version won't run because it always shows an error. But the weird thing is, the release build didn't show this particular error at all. FYI, actually the code that gives an error is from another programmer. He said it's fine as long as the release doesn't show this error. And I just take his word on this.

@jEssYcAt: Maybe that's the problem. I actually has an idea who the culprit is (see the reply to Salem above). I admit, I haven't checked this code at all. I just assume it worked based on that programmer said. He also used this code for his application, and it (seemed) working just fine.

@CornedBee: What's a code linter?

@matsp: Actually, that idea has occured in my head. But I didn't do that because I still need to do some other things.

@medievalelks: What's a Bounds Checker?

@abachler: Not that I know of. But I don't know if the other guy used a thread in his code.

Salem
07-22-2008, 11:11 PM
> He said it's fine as long as the release doesn't show this error.
He's an idiot!
If debug and release builds differ "AT ALL" in any respect except speed, then you've got a problem.

Sang-drax
07-23-2008, 02:24 AM
Now that I think about it, the debug version won't run because it always shows an error.That's an obvious place to start, isn't it?

matsp
07-23-2008, 02:30 AM
I'm with Sang-drax and Salem here: It is highly likely that the debug build is "correctly" pointing out something that is wrong, whilst the release build is missing it, and most of the time it's not making a lot of difference, but sometimes causes a crash. Typically, this is "out of bounds" on memory allocations or arrays.

--
Mats

g4j31a5
07-24-2008, 12:31 AM
Right, maybe I'll have a look at the code. Thanks.

dwks
07-24-2008, 11:11 AM
@CornedBee: What's a code linter?
It's a program that examines your code, without compiling it or running it, and identifies bad practises or possible buffer overruns and that kind of thing. I've used splint before, and it works pretty well.
Wikipedia page: http://en.wikipedia.org/wiki/Splint_%28programming_tool%29
Download page: http://www.splint.org/
Windows download page: http://www.splint.org/win32.html


@medievalelks: What's a Bounds Checker?
It's a program or a library that examines your program as it is run, detecting any buffer overruns or memory errors. The only one I've really used is Valgrind, which is fantastic, but unfortunately only runs under Linux. I've heard of Purity, Electric Fence, and dmalloc(), but never tried any of them.

Elysia
07-24-2008, 01:00 PM
For bounds checker, I really recommend using Visual Studio because it provides very nice debugging tools, including check for overruns, both on stack and heap.
Another good idea when something goes wrong is typically using Just-in-time (JIT), which can point you to a line of code that crashed a program. Visual Studio supports this.
Or you can attach a debugger to it and hope it crashes, at which time the debugger can catch the line it occured at.

Anyway, as people have pointed out already, you should fix the debug errors first and they are very likely the problem.
As for debugging "unknown" problems, the first thing you have to do if figure out what type of error you get. Very common errors are double-free, using deleted pointers, buffer overruns and clobbering memory. Good debuggers and tools spot many of these common problems.

twomers
07-24-2008, 04:46 PM
>> Right, maybe I'll have a look at the code. Thanks.
For a program bug? Really? You sure that's the way to do it?

g4j31a5
07-27-2008, 08:20 PM
It's a program that examines your code, without compiling it or running it, and identifies bad practises or possible buffer overruns and that kind of thing. I've used splint before, and it works pretty well.
Wikipedia page: http://en.wikipedia.org/wiki/Splint_%28programming_tool%29
Download page: http://www.splint.org/
Windows download page: http://www.splint.org/win32.html


It's a program or a library that examines your program as it is run, detecting any buffer overruns or memory errors. The only one I've really used is Valgrind, which is fantastic, but unfortunately only runs under Linux. I've heard of Purity, Electric Fence, and dmalloc(), but never tried any of them.

AFAIK, valgrind is a profiler, CMIIW. I've used it before to check for a bottleneck when I was developing an application for Linux last year.

I've looked at Purity. It's commercial, right? Dunno if my boss would want to buy it. Is there any open source one?


>> Right, maybe I'll have a look at the code. Thanks.
For a program bug? Really? You sure that's the way to do it?

Not really. But it's my only lead so far. :D

dwks
08-05-2008, 12:56 PM
AFAIK, valgrind is a profiler, CMIIW. I've used it before to check for a bottleneck when I was developing an application for Linux last year.
Well, Valgrind is really an instruction framework that lets a program be executed in a sort of virtual machine, if you know what I mean. The most common use of this (that I can tell) is for memory error detection -- every malloc()/free() call and every variable access can be trapped and examined -- but I'm sure you can do other things with it as well.

Believe me, Valgrind can detect memory leaks and very strange memory errors. That's all I use it for, and I probably use it several times a week. :)


I've looked at Purity. It's commercial, right? Dunno if my boss would want to buy it. Is there any open source one?
I believe you can download a trial or something. As Elysia said, though, you can just use Microsoft's built-in functions if you're using one of their compilers. http://www.codeproject.com/KB/applications/visualleakdetector.aspx

Other suggestions: http://www.thefreecountry.com/sourcecode/debugging.shtml