Thread: Iterate through text file extreamly fast

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    222

    Iterate through text file extreamly fast

    Hi,

    I have some large files. Each having a classis table format. i need to extract information from first column second column and 4th column. Furthermore i need onyl one number from the second column. The way I am doing it right now is:

    Code:
    fstream fs;
    fs.open (file.c_str(), ios::in);
    
    int number = 0;
    long unsigned pg = 0;
    string query, subject, tmp;
    
    while( !fs.eof()) {
    
        fs >> query;
        fs >> subject;
        fs >> tmp;
        fs >> number;
        if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
      }
    1. Is there a better (faster) way to do this?
    2. if tmp is int then the code above might be a problematic, is there a way to just skip
    Code:
    fs >> tmp;
    the third column?

    b

  2. #2
    Guest
    Guest
    Could you show a few lines of the data in the table? It might be easier to suggest possible alternatives then.

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > while( !fs.eof())
    See the FAQ on why using eof() to control a loop is bad.

    > if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
    This doesn't even compile.

    Also, continue at the end of the loop adds no value anyway.

    Perhaps you should be looking at getline with a variable delimiter parameter.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    example is:

    Code:
    pgi3ti|106|	pg|38406037|ti0|	92.44	357	
    pgi3ti|105|	pg|39300935|ti0|	89.08	357	
    pgi3ti|104|	pg|29400391|ti0|	78.61	360	
    pgi3ti|102|	pg|28808954|ti0|	na	357	
    pgi3ti|101|	pg|26408981|ti0|	63.97	358
    Last edited by baxy; 04-05-2015 at 03:32 PM.

  5. #5
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    Quote Originally Posted by Salem View Post
    > while( !fs.eof())
    See the FAQ on why using eof() to control a loop is bad.
    Touché

    I absolutly agree (habit i guess)

    Quote Originally Posted by Salem View Post
    > if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
    This doesn't even compile.

    Also, continue at the end of the loop adds no value anyway.

    Perhaps you should be looking at getline with a variable delimiter parameter.
    get line vs get character is rather slow. and though this could be done pure c style I am just wondering if maybe someone has a slick and fast way of doing it

    thnx

  6. #6
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    The absolute fastest way is to just read a chunk of the file into memory and do the processing from there. Use whatever APIs your OS provide to optimize the reading. E.g. windows has a flag to indicate sequential reading, which greatly speeds up reading if you're not randomly searching in the file.
    Of course, the fastest way is usually just to read the entire file into memory and then process it. Do it in chunks if it's too large.

    Also, be sure to disable the dumb syncing with C I/O that's on by default. Assuming f is a file stream:

    f.sync_with_stdio(false);

    This will also speed up reads by quite a bit.
    Last edited by Elysia; 04-05-2015 at 05:39 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #7
    Guest
    Guest
    Quote Originally Posted by baxy View Post
    example is:
    Code:
    pgi3ti|106|	pg|38406037|ti0|	92.44	357	
    pgi3ti|105|	pg|39300935|ti0|	89.08	357	
    pgi3ti|104|	pg|29400391|ti0|	78.61	360	
    pgi3ti|102|	pg|28808954|ti0|	na	357	
    pgi3ti|101|	pg|26408981|ti0|	63.97	358
    If you can guarantee certain patterns in each line, then you might be able to write a faster extraction algorithm. You might be able to do blind reads for each column and check for the delimiters in your rows .|.|<tab>.|.|.|<tab>.<tab>.<newline>

    As Elysia suggested, consider reading your file into memory first. With this kind of data, your CPU can likely outrun your hard drive. If the file is very big (are we talking GiB, dozens of GiB?) and your memory constrained, you could try having one thread read in chunks of raw data while other threads start extracting it. This way you'll start processing while reading is still in progress.

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > you could try having one thread read in chunks of raw data while other threads start extracting it.
    Probably not.
    terminology - CPU bound and I/O bound? - Stack Overflow

    Your idea would work where you were doing some intense processing on chunks of file data (say applying video effects to an MPEG stream), but simply tokenising a text file isn't going to make the CPU the bottleneck.

    > Of course, the fastest way is usually just to read the entire file into memory and then process it.
    > Do it in chunks if it's too large.
    It's vitally important to choose the chunk size wisely. If you make it too large, all you will succeed in doing is transferring the contents of the file (on disk) to the contents of the swap file (on disk).
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    I give up. So here are my test codes and runtimes. File is exactly as i described it above. In the examples below i was interested in the first column and in the number in the second column.

    Runtime tests were preformed on a 1 GB file


    Line reader:
    Code:
    #include <iostream>
    #include <fstream>
    #include <vector>
    #include <string>
    #include <stdexcept>
    
    
    using namespace std;
    
    
    void ReadFile(const string file){
    
    
      fstream fs;
    
      fs.open (file.c_str(), ios::in);
      if ( !fs.is_open())
        throw runtime_error ("Cannot open file: " + file );
        
      string line;
      int e = 0, cnt = 0;
      
      while( fs.good()) {
        string t;
        getline(fs,line);
        if((cnt = sscanf(&line[0], "%*s\tpg|%d|", &t, &e)) == 2){
          cout <<  t << " " << e <<  endl;
        }
      }
    fs.close();
    
    }
    
    int main (){
      
    try{
       ReadFile("testfile");
    }catch(runtime_error& e){
      cerr << e.what() << "\n";
    }
      return 0;
    }
    Runtime:

    real 0m45.838s
    user 0m30.918s
    sys 0m14.437s

    Mem reader:

    Code:
    #include <iostream>
    #include <fstream>
    #include <vector>
    #include <string>
    #include <cstring>
    #include <stdexcept>
    #include <fcntl.h>
    
    
    using namespace std;
    
    
    void ReadFile(const string file){
    
      int fd = open(&file[0], O_RDONLY);
      
      if ( fd == -1)
        throw runtime_error ("Cannot open file: " + file );
    
      static const auto BUFFER_SIZE = 16*1024;
      posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL
      char buf[BUFFER_SIZE + 1];
    
    
      while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)){
            if(bytes_read == (size_t)-1)
                throw runtime_error("read failed");
            if (!bytes_read)
                break;
            int  cnt =0;
    
          for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p);  // end line detecting
        }
    }
    
    
    
    int main (){
      
    try{
       ReadFile("testfile");
    }catch(runtime_error& e){
      cerr << e.what() << "\n";
    }
      return 0;
    }
    Runtime:

    real 0m0.280s
    user 0m0.100s
    sys 0m0.176s

    However, as soon as I start extracting information from the line (second solution) runtime increases and is almost equal to the first one (5 seconds faster). Any suggestions on how to improve upon this

    thnx

    PS

    ok to be fair if i add

    Code:
    for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p){
    cout << "write something " << endl;
    
    }
    Runtime:

    real 0m12.260s
    user 0m2.252s
    sys 0m9.985s
    Last edited by baxy; 04-06-2015 at 05:30 AM.

  10. #10
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    Quote Originally Posted by baxy View Post
    Any suggestions on how to improve upon this
    It should be noted that adding console output to your algorithm will likely skew your timing results significantly. Dump the results to a file and I suspect you'll see a noticeable performance improvement in your tests.

    Another option I haven't seen mentioned yet is memory mapping. It works best for random access rather than sequential access as you're apparently doing, but still worth testing as you weigh options. As you've probably found, abstractions tend to slow things down, so if you're looking for sheer speed, drop down to the lowest level you can to eliminate abstractions (such as those offered by <iostream>).
    My best code is written with the delete key.

  11. #11
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    Also mixing C stdio functions with C++ streams and std::strings is sure to cause problems.

    I'd start by reading a complete line with getline() and then use a stringstream to parse that line. If your file reading is still too slow, and you've profiled your code to determine where the actual problem lies you could then try other alternatives. Something like this is where I would start.

    Code:
    #include <string>
    #include <sstream>
    #include <fstream>
    #include <vector>
    #include <iostream>
    
    struct information
    {
       std::string first;
       int second;
       int fourth;
    };
    
    /* I prefer to pass an open file stream and a container to hold the data.
       Let the calling function deal with opening and checking to insure the file
       actually opened correctly.
    */
    size_t  ReadFile(std::istream& fin, std::vector<information>& info)
    {
       std::string line;
       size_t i = 0;
       std::stringstream sin;
       std::string trash;
       while(getline(fin,line))
       {
          sin.str(line);
          getline(sin, info[i].first, '|');
          sin >> info[i].second >> trash;
          getline(sin, trash, '|');
          sin >> info[i].fourth;
          ++i;
          if(!(i % 1000))
             info.resize(i + 1000);
    
       }
       return(i);
    }
    
    int main()
    {
       std::string test{"pgi3ti|106| pg|38406037|ti0|    92.44   357\n"
                        "pgi3ti|105| pg|39300935|ti0|    89.08   357\n"
                        "pgi3ti|104| pg|29400391|ti0|    78.61   360\n"
                        "pgi3ti|102| pg|28808954|ti0|    na  357\n"
                        "pgi3ti|101| pg|26408981|ti0|    63.97   358\n"};
    
       // Create a vector with an initial 1000 elements.
       std::vector<information> results(1000);
       // Using a stringstream to simulate a file.
       std::stringstream fin(test);
    
       size_t numRecords = ReadFile(fin, results);
       for(size_t i = 0; i < numRecords; ++i)
          std::cout << results[i].first << " " << results[i].second << " " << results[i].fourth << std::endl;
    }

    Jim
    Last edited by jimblumberg; 04-06-2015 at 07:35 AM.

  12. #12
    Guest
    Guest
    Quote Originally Posted by Salem View Post
    > you could try having one thread read in chunks of raw data while other threads start extracting it.
    Probably not.
    terminology - CPU bound and I/O bound? - Stack Overflow

    Your idea would work where you were doing some intense processing on chunks of file data (say applying video effects to an MPEG stream), but simply tokenising a text file isn't going to make the CPU the bottleneck.
    I was thinking that it might allow for faster overall speed, if "readline, extract, readline, extract..." is slower than a couple "read-chunk, extract, read-chunk, extract...", but that was an uneducated guess.

  13. #13
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Terminal output is dog slow. In your "slower" implementation, you are outputting to cout inside your innermost loop. Obviously that's going to be ruinous to performance.

    Also, I've seen some truly awful (performance-wise) C++ IO implementations. It might suck simply because you're using streams. I know some people are going to disagree with me on that, but it's my experience.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  14. #14
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by brewbuck View Post
    Also, I've seen some truly awful (performance-wise) C++ IO implementations. It might suck simply because you're using streams. I know some people are going to disagree with me on that, but it's my experience.
    No, that just about sums it up. I have experienced the same, as well.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. extreamly slow IO, how to make it faster
    By baxy in forum C++ Programming
    Replies: 8
    Last Post: 12-01-2012, 10:37 PM
  2. need help FAST, counting words in a file with fscanf
    By busdude in forum C Programming
    Replies: 12
    Last Post: 12-04-2009, 06:56 PM
  3. Iterate through every file in a directory
    By Sharke in forum C Programming
    Replies: 5
    Last Post: 04-13-2009, 10:12 AM
  4. im extreamly new help
    By rigo305 in forum C++ Programming
    Replies: 27
    Last Post: 04-23-2004, 11:22 PM
  5. How to sort a file fast?
    By k_w_s_t_a_s in forum C++ Programming
    Replies: 2
    Last Post: 05-13-2003, 01:23 PM