Thread: extreamly slow IO, how to make it faster

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    222

    extreamly slow IO, how to make it faster

    Hi,

    couple a days ago I have asked the question "How to read the file for which i have no clue how large it is". But i did not know how inefficient will i be when i write it down. So in the following section i give you pseudo-codes of two reading systems (c and c++) and hope that you can help me to make it faster.

    C function that reads the file of unknown size:

    Code:
    while((c = read(fd,buf,BUFSIZ)) > 0){
        if(s->len + c > maxLen){
          if(maxLen >= BUFSIZ)
    	maxLen *= 2;
          else
    	maxLen = BUFSIZ + 1;
          s->seq = (char *)realloc(s->seq,maxLen*sizeof(char));
        }else if(buf[i] != '\n'){
    	s->seq[s->len] = buf[i];
    	s->len++;
        }
    }
    read time:
    time ./cread -i test > tmp

    real 0m0.169s
    user 0m0.120s
    sys 0m0.030s


    C++ function that reads the file of unknown size:

    Code:
    while (std::getline(infile, line, '\n')){
          string.push_back(line);
      }
    time ./cppread -i test > tmp

    real 0m4.692s
    user 0m4.650s
    sys 0m0.030s


    Now this was tested under gcc 4.4.3 ubuntu on a file that has 1100000 characters. what i am dealing with are files that are 100x larger that that and i cannot invest 3h loading the file and 20 min computing. Is there faster way to load files of unknown size?

    Please help

    baxy

    P.S.

    there might be some spelling errors in the code.
    Last edited by baxy; 11-30-2012 at 05:17 AM.

  2. #2
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    time ./cread -i test > tmp

    real 0m0.169s
    user 0m0.120s
    sys 0m0.030s


    time ./cppread -i test > tmp

    real 0m4.692s
    user 0m4.650s
    sys 0m0.030s

    Now this was tested under gcc 4.4.3 ubuntu on a file that has 1100000 characters. what i am dealing with are files that are 100x larger that that and i cannot invest 3h loading the file and 20 min computing. Is there faster way to load files of unknown size?
    Your readings are nowhere near 3 hours. If I simply scale the time by 100 it's about 7 minutes required.

    Standard methods do not get better than this though. You'll have to start looking into platform specific options, if you aren't pulling my leg.

  3. #3
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by baxy
    So in the following section i give you pseudo-codes of two reading systems (c and c++) and hope that you can help me to make it faster.
    How did you manage to time pseudocode execution?

    Perhaps you mean code snippets instead, but then you might as well provide the simplest and smallest compilable programs that we can examine and run for ourselves. At the moment your C code looks strange because you rely on a variable named i that is never changed in the code.

    Oh, and I note that your C "pseudocode" and your C++ code are not equivalent in functionality. Consequently, before asking how to optimise, you should state what are your exact requirements.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  4. #4
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.

    Jim

  5. #5
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    Quote Originally Posted by jimblumberg View Post
    To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.
    I have a hard time believing that resize() and assignment is faster than allowing the vector to do its own memory management. do you have evidence to support this?

    no matter how you do it, the whole vector is getting copied every time it gets resized. either way, the copy constructor or copy assignment operator is getting called for each insertion, unless move semantics available, which means gcc 4.3 or newer, and only with the -std=c++0x or -std=gnu++0x compiler options. since the OP said that the compiler is gcc 4.4.3, the option should be available. that would be the first place I'd start looking for performance improvements in the C++ version.

  6. #6
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by jimblumberg View Post
    To speed up the C++ "program" eliminate the push_back() calls, use resize() to create multiple elements at a time then just assign values to these existing elements.
    That will probably only improve things if the number of elements allocated via the resize causes it to grow more than the subsequent push_back's would.

    I would do a test to see which takes more time, the getline or the push_back. Remove one or the other and see how they compare, doing whatever you can to ensure the bits youre trying to measure aren't optimised out.

    Also, you may find that push_back of an empty string and then swapping the back element with the string just read in, avoids the copy.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  7. #7
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Quote Originally Posted by Elkvis View Post
    I have a hard time believing that resize() and assignment is faster than allowing the vector to do its own memory management. do you have evidence to support this?

    no matter how you do it, the whole vector is getting copied every time it gets resized. either way, the copy constructor or copy assignment operator is getting called for each insertion, unless move semantics available, which means gcc 4.3 or newer, and only with the -std=c++0x or -std=gnu++0x compiler options. since the OP said that the compiler is gcc 4.4.3, the option should be available. that would be the first place I'd start looking for performance improvements in the C++ version.
    I suspect you've made Jim's arguments for him.

    Using push_back() repeatedly will involve at least one resize of the vector (with all the copying overhead that entails) unless the vector starts with reserved capacity that exceeds the final size.

    A single resize in advance will mean no subsequent need for resizing. Assuming that the final size can be determined in advance.

    Personally, I would test to compare the "resize in advance and assign" and "reserve in advance and use push_back()" approaches. But, as a rule of thumb, it is reasonable to expect that repeated reallocations of memory (eg by resizing or re-reserving) will be more expensive than a single resize() (or reserve()) and assignment. The main exceptions to that rule of thumb would be if default initialisation of the vector elements is expensive, or if repeated assignment is.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  8. #8
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    Quote Originally Posted by grumpy View Post
    Assuming that the final size can be determined in advance.
    that's the only case where using resize() initially really helps. the input is coming from a file, so unless the file contains header information about its content, without reading the entire file at least once, it's impossible to know how many lines you should expect. since the OP makes no reference to such a header, I can only assume that we are talking about reading from a file of unknown length. under those conditions, allocating space in advance is pointless, because you either do too much and waste space, potentially impacting system performance if it starts to swap out to disk, or you don't do enough and wind up re-sizing automatically anyway.

    The main exceptions to that rule of thumb would be if default initialisation of the vector elements is expensive, or if repeated assignment is.
    obviously, it's not a big concern with std::string, but other types could make things more interesting.

  9. #9
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    One thing you're doing that's bound to cause problems is storing the entire file as a long string. Do you really need to do that? It's generally better to consume a file as you need the data. Applications that do require the whole file to be in memory often don't need it to be in continuous memory. If you were to store use multiple strings, or a different data structure entirely, there would be a speedup.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Trying to make this run faster
    By wkohlani in forum C Programming
    Replies: 32
    Last Post: 06-27-2009, 11:42 PM
  2. Some help with make my programs faster
    By Sshakey6791 in forum C++ Programming
    Replies: 11
    Last Post: 12-11-2008, 01:41 PM
  3. does const make functions faster?
    By MathFan in forum C++ Programming
    Replies: 7
    Last Post: 04-25-2005, 09:03 AM
  4. Critique / Help me make this program run faster.
    By Mastadex in forum C++ Programming
    Replies: 10
    Last Post: 06-26-2004, 11:58 AM
  5. Replies: 8
    Last Post: 12-27-2003, 02:30 PM