Thread: Read a random line from a text file

  1. #1
    Registered User
    Join Date
    Oct 2010
    Posts
    42

    Read a random line from a text file

    Hi everyone,

    I want to write a method that reads a random line from a text file(each line only containing one word).

    I was wondering is there a more memory efficient way to do this rather than storing all of the lines in a monstrous array and simply selecting a random element of such array. This would prove most difficult if say i was to open several text files each with a few thousand words each.

    Any educational direction is more than welcome.

    Thank you
    Ryan

  2. #2
    C++ Junkie Mozza314's Avatar
    Join Date
    Jan 2011
    Location
    Australia
    Posts
    174
    You could record the starting positions of each line using fstream::tellg(), then when you know how many lines there are, choose a random one, lookup the starting position of that line and use fstream::seekg() to go back to that part of the file.

    Alternatively, if you're happy for the lines to be weighted slightly differently, you could measure the size of your file, choose a random byte, move backwards until you find a '\n' character and then print out the line that follows it.

    Also, if you're on linux and want to quickly do it the inefficient way, you can always do this:
    Code:
    cat someFile.txt | head -n312 | tail -n1
    (prints out 312th line)

  3. #3
    Registered User
    Join Date
    Oct 2010
    Posts
    42
    I done some reading on the seekg and tellg methods, seems fitting.
    The code i've written seems to have counted the characters. how could i re-write to count the lines. also the random "line" produced is always equal to character y for some reason..
    Code:
    #include <iostream>
    #include <fstream>
    #include <string>
    using namespace std;
    
    int main() 
    {
      string selected = "";
      int lines = 0;
    
      ifstream inputFile ("noun.txt");
    
      if(inputFile.is_open() && inputFile != NULL)
    	{
    		inputFile.seekg(0, ios::end);
    		lines = inputFile.tellg();
    		inputFile.seekg(0, ios::beg);
    
    		inputFile.seekg(rand());
    
    		inputFile >> selected;
    
    		cout << selected << endl;
    		cout << lines << endl;
    	}
      else
      {
    	  cout << "Error with file" << endl;
      }
    		inputFile.close();
    
      return 0;
    }

  4. #4
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    That code will not serve you well.

    >> if(inputFile.is_open() && inputFile != NULL)
    You do not need to do both of these. I personally prefer is_open() because it is very clear about what is being tested, and it does the test right. If you want to do it the other way, please use the stream's name, which is converted to a Boolean in this context. It turns out that a (intermediate) pointer conversion is how a stream decides whether to be true or false, but you're relying on this hackish conversion instead of only the Boolean one.

    >> lines = inputFile.tellg();
    This would not tell you the number of lines.

    >> inputFile.seekg(rand());
    Even if the previous bit did tell you the number of lines, this does not take that into account. Brush up on random numbers -- http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx

    >> inputFile >> selected;
    Wouldn't get you a line. Brush up on std::string -- Cprogramming.com - C++ Standard Library - String Class

    Summing up, you're bad at following Moz's advice.

    To help Moz's advice make sense, go through the file once and find all the newlines. When you find one, tellg() the place and shove it in a vector:

    Code:
    vector<streampos> lineplaces;
    
    vector.push_back( 0 );
    while ( inputFile.get(ch) ) {
       if ( ch == '\n' ) lineplaces.push_back( inputFile.tellg() + 1);
    }
    Now you know where all the lines are. Pick a random index in for vector and zoom to that place in the stream from the beginning. Pick up the entire line.

    This is a very disk intensive method though, so if the file is small enough, you may as well read the whole file into a container of strings and select a random index for that. If the file is very large, I'm really not sure what would be good. But my idea of large is like, on the order of hundreds of megabytes.
    Last edited by whiteflags; 02-21-2011 at 05:50 PM.

  5. #5
    C++ Junkie Mozza314's Avatar
    Join Date
    Jan 2011
    Location
    Australia
    Posts
    174
    If the file is very large, I'm really not sure what would be good. But my idea of large is like, on the order of hundreds of megabytes.
    I have actually wondered what would be a good method for this after responding to the OP. For a file which is *very* large, like you mention (no reason we couldn't do this with terabytes actually), it would be interesting to design a program which chooses a random line with a reasonably even distribution as fast as possible. My solution would be along the lines of sampling several hundred lines for their length to get an estimate of the distribution, then pick a line by the simple method of going to random bytes and navigating backwards to the newline character, and either accept this line or reject it and resample based on a calculated probability (based on the estimated distribution) that corrects the distribution so that every line length is equally likely to be selected.

    Having put the above technique into words, a technique that should suit the OP would be to choose a random point in the file and then return the line *after* that. If line length is independent of the length of adjacent lines, then this would choose a truly random line. Unfortunately in the general case you couldn't just assume that though.

    By the way RyanLeonard if you do decide to go with this method, if there isn't a line after the line that you jump to, make sure that you return the first line of the file. Otherwise the first line will have a 0% chance of getting picked.

    EDIT: I think I should could correct myself on a technicality here. Line length being independent of adjacent lines is not quite enough to make the technique produce a truly random line (i.e. with a completely uniform distribution). Problem is the line that owns the random byte you select is not the one that gets chosen, so there is a very small bias against large lines.
    Last edited by Mozza314; 02-22-2011 at 05:14 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Formatting the contents of a text file
    By dagorsul in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2008, 12:36 PM
  2. Replies: 3
    Last Post: 03-04-2005, 02:46 PM
  3. Possible circular definition with singleton objects
    By techrolla in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2004, 10:46 AM
  4. Read SPECIFIC line from text file
    By 13373ar5 in forum C++ Programming
    Replies: 5
    Last Post: 05-21-2004, 05:14 AM
  5. how do I read the last line of a text file?
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 04-12-2002, 05:34 PM