Toughest bug/war stories

This is a discussion on Toughest bug/war stories within the Tech Board forums, part of the Community Boards category; This is an offshoot of the previous question but what is the toughest bug that you yourself personally found and ...

  1. #1
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875

    Toughest bug/war stories

    This is an offshoot of the previous question but what is the toughest bug that you yourself personally found and eliminated? How did it manifest and how did you find it? If the solution was interesting, add that too....

    Without thinking overly much on it one memorable bug was when I was working for this medical insurance billing firm in the early 90's. Once a month they had to print invoices and without fail after a few hundred thousand invoices it would crash the system. My team and I looked at the code for a week (it was a non-trivial amount of code) and could find no answers for the crash; it was all just math, no fooling with pointer math or intense system calls (Not that it matters but this was on Win31).....turned out the HP laser printer driver had a memory leak in it which was only revealed when we switched the printing of invoices to a dot-matrix printer (and the crashes went away). Printing single-document jobs of course would reveal no bad behavior but after a lot of prints the system would crash in unexpected ways. Because it was a printer that was designed to print in high volumes, swapping it out for another of a similar type was out of the question. While the finding of the bug wasn't true C++, the nature of the crash made us look hard at our own code thinking the issue was there....

    Another more code-oriented bug was one which was plaguing a firm I once worked for that ran a mixed environment (Win31, DOS, OS/2). Symptoms were that occasionally when an OS/2 windowed app was started and then later closed, it would also close/kill some other seemingly random application, even if it was a DOS or Windows app. The problem was a bug in the windowing library (XVT ) that was used where through some brain-dead logic when it created a child window from a parent, it would create the child and then store the ID of the window just created, usually the one with focus. For those who don't see where this is heading if the user started this app and before it finished started any other app in the mixed environment, the parent of the app using Xvt would attach itself so a random child window (whatever had focus under the mistaken idea that since it created the window, the child with forcus was his) and thus, when the parent window/app was closed it would take out the child window it had attached itself to, regardless of the child windows' true parentage. Getting these symptoms reported by non-computer folk muddied the waters even further. That one was the source of a great deal of profanity....


    I have more but that was one of the tricker ones to catch.....what is your personal best/worst?

    Peace
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

  2. #2
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    Toughest bug I ever faced ....
    Attached Images Attached Images  
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  3. #3
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,699
    SQL-related bug. Not. Fun.

  4. #4
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,239
    I'll bite. Back when I was working on one of the world's most successful PCL imaging and viewing tools, we were beginning to add support for Asian languages. One of our customers, Fuji Xerox, had agreed to help us test our first beta for the new language support. They were using the tool to bring up large invoice print jobs (hundreds of thousands of pages) and they were complaining that a certain piece of text in Chinese glyphs was being substituted by some unrelated piece of text on another page. (Yes, I know that Fuji Xerox is Japanese, but they were working with a Chinese client.)

    There were multiple challenges dealing with that, the biggest of which was my inability to "see" the problem because I couldn't read Chinese. To me, the document was thousands upon thousands of pages of indecipherable symbols. To a native Chinese reader, it was obvious that on page 99512, the word "Customer" had been replaced by the word "Feature" but I was clueless. Even more bizarre, if you went to page 99513 and then paged back, the page was correct. But if you went to page 99511 and paged forward, it was wrong.

    Normally, in such a situation, I would simply extract the contents of page 99512 using our software, and having narrowed the problem to a single page, begin working on it. Bizarrely, when I tried to do this, the resulting extracted page was correct again. In other words, the problem only manifested when the ENTIRE document was processed, and then, only if you paged through it in a particular sequence. One of the most powerful tools in my tool set wasn't working for me.

    Looking back on it now, it should have been obvious. Instead, it took two weeks, but I tracked the problem to an obscure bug in the page resource caching mechanism which handled things like font definitions which carry over from previous pages. This layer of code was well tested, or so I thought. But this document was doing something extremely strange -- it was redefining the same font, on each page, using the same font ID, but with the glyphs in different sequences. A simple logic error was causing the wrong font information to be retrieved from the cache, but the exact way in which it went wrong depended on what font was previous cached at that ID -- and this depended on the sequence in which the pages were visited.

    If I remember, the fix was a one-liner. Xerox was happy, and we continued forward with our Asian language support.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Super Moderator VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,596
    Stack overflow bug caused by mis-matched calling conventions. It was hard to debug b/c it was easily missed in the source. Had to dig down into the assembly code to figure out what was going on.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Lack of real-world experience makes mine not very impressive. But whatever.

    Toughest error I've ever had to catch was very early in my programming days. A private program kept crashing. I despaired of debugging it and actually gave up the project over it. Months later, I returned to it, and found that VC++6 had miscompiled the thing, placing two global variables in the same memory location. Since then, whenever facing a non-trivial bug, first thing I do is force a clean recompile.

    Had a very tough problem recently that took me days to track down. It's easy to reproduce.
    1) Create a Qt4 program that you can drag files from:
    Code:
    QMimeData *data = new QMimeData();
    QList<QUrl> urls;
    urls += QUrl::fromLocalFile(filename);
    data->setUrls(urls);
    QDrag *drag = new QDrag();
    drag->setData(data);
    drag->exec();
    2) Drag from this program to Thunderbird's Compose window.
    3) Figure out why the name of the attachment is "Attached Object Part".

    Thank God for open source, is all I can say.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,046
    I seem to spend a long time on debugging exclusive bugs every once in a while. I've never really had any bugs that took longer than about six hours to track down -- if it takes any longer I usually just end up re-writing the offending code. Problem fixed.

    ----

    A recent very annoying bug that I can think of: I was creating an OpenGL fractal program. It was working very well until I decided it would be neat to have the fractal generation run in a separate thread, so that my Qt GUI would continue to be responsive for the (sometimes very long) time it takes to generate a fractal. So I moved the relevant part of the code into a separate thread, and of course, it starts crashing randomly.

    At least from the beginning I knew it was a problem with my multithreading. (With a change of a preprocessor directive I was able to switch back to a working, but non-threaded, version.) Still, I couldn't figure out what was going wrong. I discarded my original pthread implementation, and re-wrote it using QThread -- no difference (hardly surprising, since I'm fairly sure Qt uses pthreads on that system).

    Then I tried running the program on a different system, and it worked perfectly. I wasn't able to reproduce the crash (even though it would take only about four or five seconds to crash on the original system.) I didn't have access to the original system any more, but it didn't take me long to conjecture that since the original system had two cores and the new one had one, that an extra core was part of the problem. I spent a very long time trying to figure out what would cause my program to crash on multiple cores but run just fine on one. At one point I was convinced that the system was messing up the synchronization of cached memory between one core and the other. But the problem was in my code, of course . . . .

    For efficiency, I wasn't using any mutexes or other similar structures. It shouldn't have been a problem, because only one thread was writing to the shared data, and the other was very circumspect in its actions. Nevertheless, I was well aware of the potential for disaster here, and as a result, I carefully combed through the three or four lines where shared data was accessed from a separate thread. I couldn't see anything wrong.

    Eventually I used trial and error to figure out exactly the line that was causing the problem. The main thread was occasionally querying the size of the vector of the fractal thread to see how far complete it was. It turns out that calling vector.size() isn't an atomic operation (if you look at the source you'll see that it executes end()-start(), on that gcc at least). So every once in a while, end() would execute on one core, the vector would be resized by the other core, and start() would execute on the original core. Instant segmentation fault.

    I'm not sure why this never happened on the other single-core system. I mean, I was generating vectors with literally billions of elements (I blew through 3GB of memory in no time), so you'd think a context switch would occur at the wrong time at some point. But nope. I guess it goes to show how fickle multithreaded bugs can be.

    ----

    Here's another bug that took even longer for me to crack. A while earlier, I had been writing a graphical program for DOS (not mode 13h, but still, the same sort of idea). Having by now come to appreciate Linux and a modern development environment, I wrote an SDL layer around the DOS code and did all the development in Linux. It made me a lot more productive. Anyway, it meant that by the time I started writing a DOS mouse driver for the DOS version, I had thousands of lines of (C) code, a full windowing GUI system, and it was in general not very easy to debug.

    For efficiency, I had written the DOS mouse driver in assembly. When a mouse interrupt arrived, it would call my assembly driver, and the data would be placed in a circular queue where the C code would eventually get around to dealing with it. I also had a fast double-buffering routine written in assembly (using movsd to transfer double-words to the frame buffer was probably more efficient than the equivalent C).

    When the program was launched, it would work fine: the graphics were displayed, etc. But if you moved the mouse cursor around for a while, it would crash (sometimes with the keyboard working afterwards, usually not), or just hang (as if interrupts had been disabled).

    To make matters worse, the mouse driver worked just fine on most of my emulated DOS systems. I was emulating Windows 95 in qemu, as well as straight DOS, and FreeDOS (in qemu and natively in Linux). Strangely, FreeDOS worked but actual DOS did not; but that was because the mouse didn't work at all, in any application, not because of my code. Anyway, the strange bug didn't manifest itself except on an actual DOS system, so I was forced to do my debugging there.

    Debugging on this DOS system was a pain. GDB didn't work, so the only reliable debugging method was to log to files [printing text didn't work in this screen mode]. (Of course, fopen() was using the same DOS interrupt as the mouse driver, so that was another whole can of worms.) Every time the program crashed, a full reboot was necessary. The compile-run cycle probably took at least five minutes.

    Anyway, after perhaps eight hours of debugging, I found the problem. Turns out that the double-buffering routine was changing a few segment registers (for movsd), and when the mouse driver interrupt happened to execute in the middle of this routine, it would use the wrong segment register and crash.

    It was a pretty simple bug, but that wasn't what sticks in my mind. It was the sheer difficulty of debugging the code that was a nightmare. A difficult to reproduce problem, on a system without a debugger; the only way to report information being log files, which would sometimes introduce their own crashes due to interrupt collisions; a system that was so slow that just to change one line and test it would take five minutes at least. Aargh!

    I'm really, really glad I have Linux. And GDB. And grep. And a sane system with the tools I need to debug my programs properly. DOS is no fun.

    ----

    Enough of my stories for now. Hopefully they were entertaining.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  8. #8
    train spotter
    Join Date
    Aug 2001
    Location
    near a computer
    Posts
    3,856
    I had an error where one of the track side bearing accoustic scanners kept generating static. The static was worse at night.

    We tried everything we could remotely but eventually had to send someone to the site to replace the mic (3 day round trip).

    Turned out a big (3 cm) wasp had made its nest in the mic and the train passing made it mad (so it flapped it's wings causing the buzz).


    I also found a bug in [name removed] locos Automated Train Protection system. I was reverse engineering the file formats and could not tell why my code generated random low values.

    I confrimed the bug with data from a loco without my monitoring device.

    The locos are used in pairs or triples, controlled by the lead loco. In one case the lead loco asked the slave to confirm the speed it was doing, the slave replied 1 KmH, the lead said 'no we are doing 80 KmH' and applied the penalty breaks (which costs ~Au$5 k) and can only be removed with proper authority (mean while no trains can pass).

    I found that periodically (1-2 per month) the ATP 'forgets' (resets) and returns to a default state. This lasts for ~20 sec but during this time the loco returns 0 or 1 for most values (ie speed, oil temp, revs etc).

    I spent six months emailing [name removed] engineers before they finally declared that I was wrong, the fault was in my software as their's showed no fault at the times I listed.
    I pointed out that their viewer was not daylight saving compliant and they should look an hour earlier than the times in my viewer.

    [name removed] then refused to work with me ever again....

    My client asked me to test the patch. [name removed] had reduced the time the bug lasted for from ~20 sec to 2-6 sec but increased the frequency to 2-3 times per day.
    "Man alone suffers so excruciatingly in the world that he was compelled to invent laughter."
    Friedrich Nietzsche

    "I spent a lot of my money on booze, birds and fast cars......the rest I squandered."
    George Best

    "If you are going through hell....keep going."
    Winston Churchill

  9. #9
    & the hat of GPL slaying Thantos's Avatar
    Join Date
    Sep 2001
    Posts
    5,681
    The hardest bugs I've been facing at my new job isn't so much difficult to locate but difficult to fix. Mainly because the bug was due to an assumption that was valid for quite awhile but recently became invalid. So to fix it I have to go through all of our files and try to find the spots that use that assumption and fix it.

  10. #10
    Super Moderator VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,596
    The hardest bugs I've been facing at my new job isn't so much difficult to locate but difficult to fix. Mainly because the bug was due to an assumption that was valid for quite awhile but recently became invalid. So to fix it I have to go through all of our files and try to find the spots that use that assumption and fix it.
    And those are the hardest bugs because more bugs have been introduced due to the fact that they relied on the bug you are fixing. At that point I start creating new bug reports for each and every section of code that relied on the bug I am fixing. You could literally cross several departments with a bug like that. Not to mention that some of the code that relied on the old bug may be foreign to you and trying to fix it could introduce even more bugs since you do not understand exactly what is going on in the code.

  11. #11
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,239
    Quote Originally Posted by Bubba View Post
    And those are the hardest bugs because more bugs have been introduced due to the fact that they relied on the bug you are fixing. At that point I start creating new bug reports for each and every section of code that relied on the bug I am fixing. You could literally cross several departments with a bug like that. Not to mention that some of the code that relied on the old bug may be foreign to you and trying to fix it could introduce even more bugs since you do not understand exactly what is going on in the code.
    Undocumented assumptions can be extremely difficult to untangle, especially when the code was written long ago and hasn't changed since it was first created. Any condition that, if it were false, would cause incorrect behavior, should be guarded with some kind of condition checking mechanism like an assert.

    Problem is, asserts are normally removed for release, and software rarely undergoes full regression testing in debug mode. This leaves your customers to be the ones to experience the full effect of the bugs that occur when the assumptions are violated.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  12. #12
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Portugal
    Posts
    7,438
    I don't remember when, but I discussed it on a different context before. Probably the toughest bug I came up with was something related to the below code. I can't quite remember the details how I came across it. All I retained was the note I added to OneNote:

    Code:
    while (i < array_length)
        foo[i] = foo[i++];
    Undefined behavior is a b...
    Last edited by Mario F.; 12-28-2009 at 07:46 AM.
    The programmer’s wife tells him: “Run to the store and pick up a loaf of bread. If they have eggs, get a dozen.”
    The programmer comes home with 12 loaves of bread.


    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  13. #13
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    and that stub basically does ... nothing???

    i++ doesn't increment i until after evaluation, so it would basically step through the array setting each value to itself, was it a timing loop? or was foo some class with an overloaded = operator?

    or was teh bug the fact that it didn't move the array values towards the beginning, as it would if it used ++i ?
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

  14. #14
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,239
    Quote Originally Posted by abachler View Post
    and that stub basically does ... nothing???
    It does something undefined. The post-increment happens some point after the evaluation of i on the right-hand-side. It is unknown whether it happens before or after the assignment, however.

    The use of a variable multiple times in an expression, where one of the uses involves an increment/decrement operator, is explicitly undefined and one of the classic examples of undefined behavior.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #15
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,189
    No, that part is not undefined. the value of the index must be calculated before the assignment, as the assignment depends on the data at the memory location, hence the pointer must be resolved prior to assignment. The index into the array must be calculated before the pointer can be resolved, and i++ specifically uses the value prior to incrementation as the evaluated value, hence by the standard it must resolve to foo[i]. If it produces variant behavior then the implementation is non-compliant.

    What would be undefined however is foo[i] = foo[i++] + i, as there is no guarantee in which order the variables foo[i++] and i are evaluated, and hence it may add either i or i+1.
    Last edited by abachler; 12-28-2009 at 02:32 PM.
    Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

Page 1 of 3 123 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Name Stories
    By sean in forum A Brief History of Cprogramming.com
    Replies: 55
    Last Post: 11-22-2004, 03:02 AM
  2. toughest math course
    By axon in forum A Brief History of Cprogramming.com
    Replies: 12
    Last Post: 10-28-2003, 09:06 PM
  3. Drug stories
    By Zewu in forum A Brief History of Cprogramming.com
    Replies: 1
    Last Post: 07-30-2003, 08:12 AM
  4. Drinking stories
    By Govtcheez in forum A Brief History of Cprogramming.com
    Replies: 24
    Last Post: 07-30-2003, 07:24 AM
  5. I love ghost stories
    By Nutshell in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 07-04-2003, 01:57 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21