Iterate through text file extreamly fast

**baxy** · 04-05-2015

Hi,

I have some large files. Each having a classis table format. i need to extract information from first column second column and 4th column. Furthermore i need onyl one number from the second column. The way I am doing it right now is:

Code:

fstream fs;
fs.open (file.c_str(), ios::in);

int number = 0;
long unsigned pg = 0;
string query, subject, tmp;

while( !fs.eof()) {

    fs >> query;
    fs >> subject;
    fs >> tmp;
    fs >> number;
    if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
  }

1. Is there a better (faster) way to do this?
2. if tmp is int then the code above might be a problematic, is there a way to just skip

Code:

fs >> tmp;

the third column?

b

Guest · 04-05-2015

Could you show a few lines of the data in the table? It might be easier to suggest possible alternatives then.

**Salem** · 04-05-2015

> while( !fs.eof())
See the FAQ on why using eof() to control a loop is bad.

> if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
This doesn't even compile.

Also, continue at the end of the loop adds no value anyway.

Perhaps you should be looking at getline with a variable delimiter parameter.

**baxy** · 04-05-2015

example is:

Code:

pgi3ti|106|	pg|38406037|ti0|	92.44	357	
pgi3ti|105|	pg|39300935|ti0|	89.08	357	
pgi3ti|104|	pg|29400391|ti0|	78.61	360	
pgi3ti|102|	pg|28808954|ti0|	na	357	
pgi3ti|101|	pg|26408981|ti0|	63.97	358

**baxy** · 04-05-2015

Originally Posted by Salem

> while( !fs.eof())
See the FAQ on why using eof() to control a loop is bad.

Touché

I absolutly agree (habit i guess)

Originally Posted by Salem

> if (cnt = sscanf(subject, "pg|%lu", &pg))!=1)continue;
This doesn't even compile.

Also, continue at the end of the loop adds no value anyway.

Perhaps you should be looking at getline with a variable delimiter parameter.

get line vs get character is rather slow. and though this could be done pure c style I am just wondering if maybe someone has a slick and fast way of doing it

thnx

**Elysia** · 04-05-2015

The absolute fastest way is to just read a chunk of the file into memory and do the processing from there. Use whatever APIs your OS provide to optimize the reading. E.g. windows has a flag to indicate sequential reading, which greatly speeds up reading if you're not randomly searching in the file.
Of course, the fastest way is usually just to read the entire file into memory and then process it. Do it in chunks if it's too large.

Also, be sure to disable the dumb syncing with C I/O that's on by default. Assuming f is a file stream:

f.sync_with_stdio(false);

This will also speed up reads by quite a bit.

Guest · 04-05-2015

Originally Posted by baxy

example is:

Code:

pgi3ti|106|	pg|38406037|ti0|	92.44	357	
pgi3ti|105|	pg|39300935|ti0|	89.08	357	
pgi3ti|104|	pg|29400391|ti0|	78.61	360	
pgi3ti|102|	pg|28808954|ti0|	na	357	
pgi3ti|101|	pg|26408981|ti0|	63.97	358

If you can guarantee certain patterns in each line, then you might be able to write a faster extraction algorithm. You might be able to do blind reads for each column and check for the delimiters in your rows .|.|<tab>.|.|.|<tab>.<tab>.<newline>

As Elysia suggested, consider reading your file into memory first. With this kind of data, your CPU can likely outrun your hard drive. If the file is very big (are we talking GiB, dozens of GiB?) and your memory constrained, you could try having one thread read in chunks of raw data while other threads start extracting it. This way you'll start processing while reading is still in progress.

**Salem** · 04-06-2015

> you could try having one thread read in chunks of raw data while other threads start extracting it.
Probably not.
terminology - CPU bound and I/O bound? - Stack Overflow

Your idea would work where you were doing some intense processing on chunks of file data (say applying video effects to an MPEG stream), but simply tokenising a text file isn't going to make the CPU the bottleneck.

> Of course, the fastest way is usually just to read the entire file into memory and then process it.
> Do it in chunks if it's too large.
It's vitally important to choose the chunk size wisely. If you make it too large, all you will succeed in doing is transferring the contents of the file (on disk) to the contents of the swap file (on disk).

**baxy** · 04-06-2015

I give up. So here are my test codes and runtimes. File is exactly as i described it above. In the examples below i was interested in the first column and in the number in the second column.

Runtime tests were preformed on a 1 GB file

Line reader:

Code:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <stdexcept>


using namespace std;


void ReadFile(const string file){


  fstream fs;

  fs.open (file.c_str(), ios::in);
  if ( !fs.is_open())
    throw runtime_error ("Cannot open file: " + file );
    
  string line;
  int e = 0, cnt = 0;
  
  while( fs.good()) {
    string t;
    getline(fs,line);
    if((cnt = sscanf(&line[0], "%*s\tpg|%d|", &t, &e)) == 2){
      cout <<  t << " " << e <<  endl;
    }
  }
fs.close();

}

int main (){
  
try{
   ReadFile("testfile");
}catch(runtime_error& e){
  cerr << e.what() << "\n";
}
  return 0;
}

Runtime:

real 0m45.838s
user 0m30.918s
sys 0m14.437s

Mem reader:

Code:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <cstring>
#include <stdexcept>
#include <fcntl.h>


using namespace std;


void ReadFile(const string file){

  int fd = open(&file[0], O_RDONLY);
  
  if ( fd == -1)
    throw runtime_error ("Cannot open file: " + file );

  static const auto BUFFER_SIZE = 16*1024;
  posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL
  char buf[BUFFER_SIZE + 1];


  while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)){
        if(bytes_read == (size_t)-1)
            throw runtime_error("read failed");
        if (!bytes_read)
            break;
        int  cnt =0;

      for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p);  // end line detecting
    }
}



int main (){
  
try{
   ReadFile("testfile");
}catch(runtime_error& e){
  cerr << e.what() << "\n";
}
  return 0;
}

Runtime:

real 0m0.280s
user 0m0.100s
sys 0m0.176s

However, as soon as I start extracting information from the line (second solution) runtime increases and is almost equal to the first one (5 seconds faster). Any suggestions on how to improve upon this

thnx

PS

ok to be fair if i add

Code:

for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p){
cout << "write something " << endl;

}

Runtime:

real 0m12.260s
user 0m2.252s
sys 0m9.985s

**Prelude** · 04-06-2015

Originally Posted by baxy

Any suggestions on how to improve upon this

It should be noted that adding console output to your algorithm will likely skew your timing results significantly. Dump the results to a file and I suspect you'll see a noticeable performance improvement in your tests.

Another option I haven't seen mentioned yet is memory mapping. It works best for random access rather than sequential access as you're apparently doing, but still worth testing as you weigh options. As you've probably found, abstractions tend to slow things down, so if you're looking for sheer speed, drop down to the lowest level you can to eliminate abstractions (such as those offered by <iostream>).

**jimblumberg** · 04-06-2015

Also mixing C stdio functions with C++ streams and std::strings is sure to cause problems.

I'd start by reading a complete line with getline() and then use a stringstream to parse that line. If your file reading is still too slow, and you've profiled your code to determine where the actual problem lies you could then try other alternatives. Something like this is where I would start.

Code:

#include <string>
#include <sstream>
#include <fstream>
#include <vector>
#include <iostream>

struct information
{
   std::string first;
   int second;
   int fourth;
};

/* I prefer to pass an open file stream and a container to hold the data.
   Let the calling function deal with opening and checking to insure the file
   actually opened correctly.
*/
size_t  ReadFile(std::istream& fin, std::vector<information>& info)
{
   std::string line;
   size_t i = 0;
   std::stringstream sin;
   std::string trash;
   while(getline(fin,line))
   {
      sin.str(line);
      getline(sin, info[i].first, '|');
      sin >> info[i].second >> trash;
      getline(sin, trash, '|');
      sin >> info[i].fourth;
      ++i;
      if(!(i % 1000))
         info.resize(i + 1000);

   }
   return(i);
}

int main()
{
   std::string test{"pgi3ti|106| pg|38406037|ti0|    92.44   357\n"
                    "pgi3ti|105| pg|39300935|ti0|    89.08   357\n"
                    "pgi3ti|104| pg|29400391|ti0|    78.61   360\n"
                    "pgi3ti|102| pg|28808954|ti0|    na  357\n"
                    "pgi3ti|101| pg|26408981|ti0|    63.97   358\n"};

   // Create a vector with an initial 1000 elements.
   std::vector<information> results(1000);
   // Using a stringstream to simulate a file.
   std::stringstream fin(test);

   size_t numRecords = ReadFile(fin, results);
   for(size_t i = 0; i < numRecords; ++i)
      std::cout << results[i].first << " " << results[i].second << " " << results[i].fourth << std::endl;
}

Jim

Guest · 04-06-2015

Originally Posted by Salem

> you could try having one thread read in chunks of raw data while other threads start extracting it.
Probably not.
terminology - CPU bound and I/O bound? - Stack Overflow

Your idea would work where you were doing some intense processing on chunks of file data (say applying video effects to an MPEG stream), but simply tokenising a text file isn't going to make the CPU the bottleneck.

I was thinking that it might allow for faster overall speed, if "readline, extract, readline, extract..." is slower than a couple "read-chunk, extract, read-chunk, extract...", but that was an uneducated guess.

**brewbuck** · 04-09-2015

Terminal output is dog slow. In your "slower" implementation, you are outputting to cout inside your innermost loop. Obviously that's going to be ruinous to performance.

Also, I've seen some truly awful (performance-wise) C++ IO implementations. It might suck simply because you're using streams. I know some people are going to disagree with me on that, but it's my experience.

**Elysia** · 04-09-2015

Originally Posted by brewbuck

Also, I've seen some truly awful (performance-wise) C++ IO implementations. It might suck simply because you're using streams. I know some people are going to disagree with me on that, but it's my experience.

No, that just about sums it up. I have experienced the same, as well.

Thread: Iterate through text file extreamly fast

Thread Tools

Search Thread

Display

Iterate through text file extreamly fast

Similar Threads

extreamly slow IO, how to make it faster

need help FAST, counting words in a file with fscanf

Iterate through every file in a directory

im extreamly new help

How to sort a file fast?