extreamly slow IO, how to make it faster

**baxy** · 11-30-2012

Hi,

couple a days ago I have asked the question "How to read the file for which i have no clue how large it is". But i did not know how inefficient will i be when i write it down. So in the following section i give you pseudo-codes of two reading systems (c and c++) and hope that you can help me to make it faster.

C function that reads the file of unknown size:

Code:

while((c = read(fd,buf,BUFSIZ)) > 0){
    if(s->len + c > maxLen){
      if(maxLen >= BUFSIZ)
	maxLen *= 2;
      else
	maxLen = BUFSIZ + 1;
      s->seq = (char *)realloc(s->seq,maxLen*sizeof(char));
    }else if(buf[i] != '\n'){
	s->seq[s->len] = buf[i];
	s->len++;
    }
}

read time:
time ./cread -i test > tmp

real 0m0.169s
user 0m0.120s
sys 0m0.030s

C++ function that reads the file of unknown size:

Code:

while (std::getline(infile, line, '\n')){
      string.push_back(line);
  }

time ./cppread -i test > tmp

real 0m4.692s
user 0m4.650s
sys 0m0.030s

Now this was tested under gcc 4.4.3 ubuntu on a file that has 1100000 characters. what i am dealing with are files that are 100x larger that that and i cannot invest 3h loading the file and 20 min computing. Is there faster way to load files of unknown size?

Please help

baxy

P.S.

there might be some spelling errors in the code.

**whiteflags** · 11-30-2012

time ./cread -i test > tmp

real 0m0.169s
user 0m0.120s
sys 0m0.030s

time ./cppread -i test > tmp

real 0m4.692s
user 0m4.650s
sys 0m0.030s

Now this was tested under gcc 4.4.3 ubuntu on a file that has 1100000 characters. what i am dealing with are files that are 100x larger that that and i cannot invest 3h loading the file and 20 min computing. Is there faster way to load files of unknown size?

Your readings are nowhere near 3 hours. If I simply scale the time by 100 it's about 7 minutes required.

Standard methods do not get better than this though. You'll have to start looking into platform specific options, if you aren't pulling my leg.

**laserlight** · 11-30-2012

Originally Posted by baxy

So in the following section i give you pseudo-codes of two reading systems (c and c++) and hope that you can help me to make it faster.

How did you manage to time pseudocode execution?

Perhaps you mean code snippets instead, but then you might as well provide the simplest and smallest compilable programs that we can examine and run for ourselves. At the moment your C code looks strange because you rely on a variable named i that is never changed in the code.

Oh, and I note that your C "pseudocode" and your C++ code are not equivalent in functionality. Consequently, before asking how to optimise, you should state what are your exact requirements.

**jimblumberg** · 11-30-2012

To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.

Jim

**Elkvis** · 11-30-2012

Originally Posted by jimblumberg

To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.

I have a hard time believing that resize() and assignment is faster than allowing the vector to do its own memory management. do you have evidence to support this?

no matter how you do it, the whole vector is getting copied every time it gets resized. either way, the copy constructor or copy assignment operator is getting called for each insertion, unless move semantics available, which means gcc 4.3 or newer, and only with the -std=c++0x or -std=gnu++0x compiler options. since the OP said that the compiler is gcc 4.4.3, the option should be available. that would be the first place I'd start looking for performance improvements in the C++ version.

**iMalc** · 11-30-2012

Originally Posted by jimblumberg

To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.

That will probably only improve things if the number of elements allocated via the resize causes it to grow more than the subsequent push_back's would.

I would do a test to see which takes more time, the getline or the push_back. Remove one or the other and see how they compare, doing whatever you can to ensure the bits youre trying to measure aren't optimised out.

Also, you may find that push_back of an empty string and then swapping the back element with the string just read in, avoids the copy.

**grumpy** · 11-30-2012

Originally Posted by Elkvis

I have a hard time believing that resize() and assignment is faster than allowing the vector to do its own memory management. do you have evidence to support this?

no matter how you do it, the whole vector is getting copied every time it gets resized. either way, the copy constructor or copy assignment operator is getting called for each insertion, unless move semantics available, which means gcc 4.3 or newer, and only with the -std=c++0x or -std=gnu++0x compiler options. since the OP said that the compiler is gcc 4.4.3, the option should be available. that would be the first place I'd start looking for performance improvements in the C++ version.

I suspect you've made Jim's arguments for him.

Using push_back() repeatedly will involve at least one resize of the vector (with all the copying overhead that entails) unless the vector starts with reserved capacity that exceeds the final size.

A single resize in advance will mean no subsequent need for resizing. Assuming that the final size can be determined in advance.

Personally, I would test to compare the "resize in advance and assign" and "reserve in advance and use push_back()" approaches. But, as a rule of thumb, it is reasonable to expect that repeated reallocations of memory (eg by resizing or re-reserving) will be more expensive than a single resize() (or reserve()) and assignment. The main exceptions to that rule of thumb would be if default initialisation of the vector elements is expensive, or if repeated assignment is.

**Elkvis** · 11-30-2012

Originally Posted by grumpy

Assuming that the final size can be determined in advance.

that's the only case where using resize() initially really helps. the input is coming from a file, so unless the file contains header information about its content, without reading the entire file at least once, it's impossible to know how many lines you should expect. since the OP makes no reference to such a header, I can only assume that we are talking about reading from a file of unknown length. under those conditions, allocating space in advance is pointless, because you either do too much and waste space, potentially impacting system performance if it starts to swap out to disk, or you don't do enough and wind up re-sizing automatically anyway.

The main exceptions to that rule of thumb would be if default initialisation of the vector elements is expensive, or if repeated assignment is.

obviously, it's not a big concern with std::string, but other types could make things more interesting.

**King Mir** · 12-01-2012

One thing you're doing that's bound to cause problems is storing the entire file as a long string. Do you really need to do that? It's generally better to consume a file as you need the data. Applications that do require the whole file to be in memory often don't need it to be in continuous memory. If you were to store use multiple strings, or a different data structure entirely, there would be a speedup.

Thread: extreamly slow IO, how to make it faster

Thread Tools

Search Thread

Display

extreamly slow IO, how to make it faster

Similar Threads

Trying to make this run faster

Some help with make my programs faster

does const make functions faster?

Critique / Help me make this program run faster.

Is there a faster way to open and read files, this seems too slow.