Thread: Importing Many Files, How to Skip "Bad" Lines?

  1. #1
    Registered User
    Join Date
    Apr 2010
    Location
    New Jersey
    Posts
    14

    Unhappy Importing Many Files, How to Skip "Bad" Lines?

    Hi everyone,

    I'm a moderately experienced C++ programmer working on code which must do the following:
    (a) Import data from a lot of little files
    (b) Load that data into various objects
    (c) Do stuff with that data

    The code I've written does (a), (b), and (c) pretty well, but I've noticed a problem with (a), which I want to ask you guys about.

    Suppose I have 1000 source files. My program successfully processes Files #1 through #500. But when it reaches File #501, my program chokes and seg faults, and I automatically lose ALL the data I've collected. This is a big problem, because there is a LOT of data to process. It may take me three or four hours just to reach File #500.

    When the program reads a file, each individual line is loaded into a string called Line, which is then parsed for individual values. If I'm reading gdb right (output below), the parsing is causing the trouble. As for the line in the file which is causing the trouble, I don't see any format problems with the line itself. When I run the program multiple times, it is the same exact line which causes the seg fault every time.

    What would be awesome would be a way to tell the program, "if you see a line which confuses you, skip that line, don't just automatically crash!" Skipping the entire file would be okay too.

    Below is the code I'm using. Below that is the gdb analysis of why my program is choking. Any help or advice would be appreciated!

    Code:
    vector<string> ListOfFiles;
    string Line;
    vector<string> ValRow;
    
    // Load all the file names into ListOfFiles
    
    for(int i=0; i<ListOfFiles.size(); i++)
      {
        Line.clear();
        ifstream In_Flows((ListOfFiles[i]).c_str());
        while (getline(In_Flows, Line))
          {
            istringstream linestream(Line);
            ValRow.clear();
            while(getline(linestream, Value, ','))
              { ValRow.push_back(Value); }
          }
        // Load contents of ValRow into objects
      }
    GDB Output
    Program received signal SIGSEGV, Segmentation fault.
    0xff056b20 in realfree () from /lib/libc.so.1
    (gdb) bt
    #0 0xff056b20 in realfree () from /lib/libc.so.1
    #1 0xff0573d4 in cleanfree () from /lib/libc.so.1
    #2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
    #3 0xff05641c in malloc () from /lib/libc.so.1
    #4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
    #5 0xff318fe4 in std::string::_Rep::_S_create ()
    from /usr/local/lib/libstdc++.so.6
    #6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
    #7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
    #8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
    PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, ATAFlag=true)
    at ReadTheFile.h:119
    #9 0x0001a2f0 in main (argc=2, argv=0xffbffccc) at Main.cpp:60
    (gdb) up
    #1 0xff0573d4 in cleanfree () from /lib/libc.so.1
    (gdb) up
    #2 0xff05652c in _malloc_unlocked () from /lib/libc.so.1
    (gdb) up
    #3 0xff05641c in malloc () from /lib/libc.so.1
    (gdb) up
    #4 0xff337734 in operator new () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #5 0xff318fe4 in std::string::_Rep::_S_create ()
    from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #6 0xff3196e0 in std::string::_M_mutate () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #7 0xff30bd28 in std::getline<char, std::char_traits<char>, std::allocator<char> > () from /usr/local/lib/libstdc++.so.6
    (gdb) up
    #8 0x000199ac in ReadTheFile (PtrFlowInfoFile=0xffbffc48,
    PtrPrefixInfoFile=0xffbffc38, PtrRouterObjLibrary=0x41808, Flag=true)
    at ReadTheFile.h:119
    119 while(getline(linestream, Value, ','))
    (gdb) print Line
    $1 = {static npos = 4294967295,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0xd756c "DataPoint0,DataPoint1,DataPoint2,DataPoint3,DataP oint4,DataPoint5,DataPoint6,DataPoint7,DataPoint8, DataPoint9"}}
    (gdb)

  2. #2
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Are you closing each file after you open it? There is a limit to how many open files a process can have at once.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  3. #3
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Tip - to get your program to fail sooner, start with file #501.
    Mainframe assembler programmer by trade. C coder when I can.

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    You're corrupting the heap -- it probably has nothing to do with a "bad" line in the file. I bet if you run file #501 individually it doesn't crash.

    Run your program under Valgrind. It will show you the problem instantly.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875
    OK Couple of things:
    1. I would be writing these "small objects" to a binary file as you read/parse them so that if this happens (or a power outage or..) your program can just pick up where it left off.
    2. I would wrap problematic code blocks in try/catch statements which when used properly can produce the effect you are searching for (if *anything* goes bad, write that log name to an error file and keep going)
    3. 500 does seem like an arbitrary number to crash on. You could be corrupting (or more to the point, overrunning) your stack because you are opening hundreds of objects on the stack and not necessarily cleaning up after yourself (I have seen how this works vary from compiler to compiler). Try new/delete on linestream and see if your problem "moves".
    4. In GDB you don't have to keep going UP to get to a particular stack frame; f 8 would have taken you there directly.
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > // Load all the file names into ListOfFiles
    Do you use any plain C-strings in this bit of code?

    Or do you call malloc / new at all anywhere prior to the code that fails?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 11
    Last Post: 10-07-2008, 06:19 PM
  2. Using c++ standards
    By subdene in forum C++ Programming
    Replies: 4
    Last Post: 06-06-2002, 09:15 AM
  3. simulate Grep command in Unix using C
    By laxmi in forum C Programming
    Replies: 6
    Last Post: 05-10-2002, 04:10 PM
  4. reinserting htm files into chm help files
    By verb in forum Windows Programming
    Replies: 0
    Last Post: 02-15-2002, 09:35 AM
  5. importing program files
    By altoba in forum C Programming
    Replies: 1
    Last Post: 01-21-2002, 12:43 PM

Tags for this Thread