Thread: reading text

  1. #1
    Registered User
    Join Date
    Sep 2007
    Posts
    6

    reading text

    I am developing a search engine which needs to search big text files and have just started with c++.

    When reading a file I wud like to know which of the following method is faster for reading from big text files.

    1.reading line by line
    Code:
      ifstream myfile ("c:\\hello.txt");
     while (! myfile.eof() )
        {
          getline (myfile,line);
          cout << line << endl;
        }
    Or

    2.streaming everything

    Code:
      while(myfile >> line){
    	stringstream os(line);
            cout<<line;
       }
    I would go for second method but just to get an opinion.... is there any faster(efficient) way of reading text from the file than the methods mentioned above.

    Thanks.

  2. #2
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    Between the two they will probably be pretty close. However, they will provide different output. The getline function reads and ignores the newline character, but then you add a newline in your output, so the output will look the same. The operator>> ignores all whitespace, so your output will have all the text bunched together with no spaces, newlines or tabs.

    If this is an issue then you might want to stick with getline.

    You might also consider reading the entire file into a stringstream in memory (using rdbuf()) and parsing it from there if you need access to everything in the file. One big read is probably faster than many small ones.

    For the best performance you'd want to look at non-standard options that work on specific platforms.

  3. #3
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You shouldn't use eof() as loop condition. See the faq.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #4
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    I've done some performance testing with C++ lately with reading large binary files, converting them to text and writing them out, and a learned a few things along the way.

    Here's the short story:

    1) endl is NOT the same as '\n'. endl will give you a newline character, but it will also flush the output stream. This takes up time - major time with lots of data.

    2) I found it was faster to write to my own buffer, and then use ostream.write() instead of writing a line at a time. Much faster.

    Todd

  5. #5
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    So, conversely, I would lean towards istream.read() instead of reading a line at a time, and read a bunch of data at a time.

    Todd

  6. #6
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Todd Burch View Post
    1) endl is NOT the same as '\n'. endl will give you a newline character, but it will also flush the output stream. This takes up time - major time with lots of data.
    Yes, indeed. If you use '\n' instead of endl when writing lots of data to a file, you will save significant amount of time because of the reduced amount of file-flushing.
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  7. #7
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    Be careful using read with text mode processing. It is intended for binary files and doesn't convert newlines (and possibly other things as well).

  8. #8
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    It does convert newlines. Or at least it should, going by the spec.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #9
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    It's possible I'm mistaken and that only happens if binary mode is set in the stream, but I seem to recall people posting problems that were caused by using read() with text mode file streams.

  10. #10
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    If by "big" you're not talking about gigabytes, try this:
    Code:
    ifstream inputFile( "input.txt" );
    string fileData( (istreambuf_iterator<char>( inputFile )),
                      istreambuf_iterator<char>() );
    Once it's all read into one big string, you can split up the lines if you want...

  11. #11
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    It's possible I'm mistaken and that only happens if binary mode is set in the stream, but I seem to recall people posting problems that were caused by using read() with text mode file streams.
    The trouble with text files is that they have this concept of "newline". I think, if you happened to have the two characters in a binary file which represent a DOS-style newline, and tried to read that file with text mode under DOS, it would get converted and your binary read would be corrupted.

    Best just to go with binary files.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  12. #12
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    Code:
    string fileData( (istreambuf_iterator<char>( inputFile )),
                      istreambuf_iterator<char>() );
    Doesn't this read in character by character? I would imagine that would be slower than line by line. I would use rdbuf() or read() (or a non-standard solution), but I'm still not sure about the newline issue.

    >> The trouble with text files is that they have this concept of "newline".
    Yes, that was the point. The OP mentioned text files, though, so your suggestion might not be possible.

  13. #13
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    Hmm... I just timed the 3 methods and it looks like the first is the fastest:
    Code:
    #include <iostream>
    #include <iterator>
    #include <fstream>
    #include <string>
    #include <ctime>
    
    using namespace std;
    
    
    void Test1( ifstream&  file )
    {
    	string line;
    	while ( !file.eof() )
    	{
    		getline( file, line );
    //		cout << line << endl;
    	}
    }
    
    void Test2( ifstream&  file )
    {
    	string line;
    	while ( file >> line )
    	{
    //		stringstream os( line );
    //		cout<<line;
    	}
    }
    
    void Test3( ifstream&  file )
    {
    	string fileData( (istreambuf_iterator<char>( file )),
    					  istreambuf_iterator<char>() );
    }
    
    typedef void (*TestFunc)( ifstream& );
    
    clock_t TimeFunc( TestFunc  func, const char*  filename )
    {
    	ifstream file( filename );
    	clock_t start = clock();
    	func( file );
    	clock_t end = clock();
    	return (end - start);
    }
    
    int main()
    {
    	for ( int i = 0; i < 10; ++i )
    	{
    		clock_t time1 = TimeFunc( &Test1, "E:/Test_10MB.txt" );
    		clock_t time2 = TimeFunc( &Test2, "E:/Test_10MB.txt" );
    		clock_t time3 = TimeFunc( &Test3, "E:/Test_10MB.txt" );
    
    		cout << endl << "Func1() time is: " << time1
    			 << endl << "Func2() time is: " << time2
    			 << endl << "Func3() time is: " << time3 << endl;
    	}
    
    	return 0;
    }
    Code:
    Func1() time is: 890
    Func2() time is: 1297
    Func3() time is: 1125
    
    Func1() time is: 843
    Func2() time is: 1266
    Func3() time is: 1125
    
    Func1() time is: 875
    Func2() time is: 1282
    Func3() time is: 1109
    
    Func1() time is: 875
    Func2() time is: 1281
    Func3() time is: 1125
    
    Func1() time is: 844
    Func2() time is: 1281
    Func3() time is: 1125
    
    Func1() time is: 860
    Func2() time is: 1265
    Func3() time is: 1141
    
    Func1() time is: 843
    Func2() time is: 1282
    Func3() time is: 1125
    
    Func1() time is: 875
    Func2() time is: 1265
    Func3() time is: 1110
    
    Func1() time is: 844
    Func2() time is: 1297
    Func3() time is: 1109
    
    Func1() time is: 859
    Func2() time is: 1266
    Func3() time is: 1125

  14. #14
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    Would you mind testing read() and rdbuf()?

    Also note that the third method fills an entire string with the data, while the other two keep overwriting the previous line/word.

  15. #15
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    line.reserve(nn) might be in order too, to cut down on potential reallocations.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. reading a char at a time from text
    By dudeomanodude in forum C++ Programming
    Replies: 7
    Last Post: 01-29-2008, 12:27 PM
  2. struct question
    By caduardo21 in forum Windows Programming
    Replies: 5
    Last Post: 01-31-2005, 04:49 PM
  3. reading from a text file help......
    By jodders in forum C++ Programming
    Replies: 2
    Last Post: 01-25-2005, 12:51 PM
  4. Reading text file and structuring it..
    By Killroy in forum C Programming
    Replies: 20
    Last Post: 11-19-2004, 08:36 AM
  5. Reading Tab Separted Text files
    By Cathy in forum C Programming
    Replies: 1
    Last Post: 02-15-2002, 10:28 AM