Thread: Searching a text file backwards

  1. #1
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    Question Searching a text file backwards

    Hi all,

    I thought this was an easy problem, but I haven't been able to find an answer in the last hour or so of searching, so maybe someone can direct me.

    I have a text file that looks like this:
    Code:
    image 2318
    281253 6.13325e+06 35.8228 0.0812 
    -3.13765 0.0117648 -1.07704
    1150 1156 1164 1176 870 268 312 308 272 254 286 260 278 267 335 253 315...
    725 741 749 757 761 787 803 967 975 971 971 979 1002 980 2086 2130 2168...
    ...
    image 2319
    283453 6.13525e+06 36.9328 0.0812
    (repeating again)
    I want to find the number after the word "image". In a previous post, I was helped to do this. However, I would like to know how to search backwards in the text file, because it grows to a few MB pretty quickly and the set of data I require will be either the last set of data or close to last set before the end of the text file.

    I thought there was a flag or something I could go to such as eof() or something, but couldn't seem to find it (feof - C++ Reference doesn't seem to apply).

    I can't find a .find backwards option either.

    Perhaps I could read in the last section of code into a buffer big enough to hold a datalog, and then search that forwards? I don't really have much of an idea of how to do this though.

    Thanks in advance for any suggestions, particularly with code, as I am out of my depth (but love a challenge).

    Geek10

  2. #2
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    string::rfind

    Aha! I just discovered a really good resource - past posts by users of this forum! I thought there would be more help on this sort of problem that Google would tell me.

    So I have found that string::rfind is the search tool I need. Now I just need some pointers as to how to use it to search backwards. Is loading a buffer a good idea? I think loading the whole text file to a string is inefficient (and probably impossible).

    Does anyone know how I can get the last chunk of a text file into a string?

    Thanks,

    Geek10

  3. #3
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    Part of the problem is that it's a text file. From any given position, you can't know where to seek to for the next line or the previous line. You have to read through the file to figure it out.

    Instead of trying to read the file backwards, I'd probably accomplish this with a sort of two-pass algorithm. First, get the file position of each image header, skipping past the rest of the data. If you know that each image takes up a certain number of lines, you can read and skip that many lines ahead; if not, you'll have to check each line for the "image" string. When you've completed this pass, you'll know to which position to seek for any image in the file. Second, seek back to the desired image and read all the data in.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  4. #4
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    In most cases I'm with pianorain. Only in very specific, uncommon cases it's a good idea to do exactly what you want.

    If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.

    You'll have to be careful about cases like where "image" spans across the 1024 character boundaries, so take that into account. All in all, pianorain's idea is easier.
    You could use a finite state machine to match "egami\n" ("\nimage" reversed), and run back through it as input for the state machine, but still pianorain's idea is easier and usually better.

  5. #5
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    Quote Originally Posted by EVOEx View Post
    You could use a finite state machine to match "egami\n" ("\nimage" reversed), and run back through it as input for the state machine
    This is possible, but I have to think it would be slower than reading the entire file through from beginning to end. Every seek basically trashes whatever caching mechanism your system uses to speed up file i/o, and you'll have to seek backwards one characters for every read. Your first suggestion doesn't seem like a bad way to do it, though.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  6. #6
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    Smile A couple of questions

    Quote Originally Posted by pianorain View Post
    Part of the problem is that it's a text file. From any given position, you can't know where to seek to for the next line or the previous line. You have to read through the file to figure it out.

    Instead of trying to read the file backwards, I'd probably accomplish this with a sort of two-pass algorithm. First, get the file position of each image header, skipping past the rest of the data. If you know that each image takes up a certain number of lines, you can read and skip that many lines ahead; if not, you'll have to check each line for the "image" string. When you've completed this pass, you'll know to which position to seek for any image in the file. Second, seek back to the desired image and read all the data in.
    I'm open for ideas. That's what I'm here for. But I do have a few questions about this technique above:
    1. Would I do this "image" header file position search every single time I want to read the text file? Remember that I am adding new data to it all of the time, and usually the latest or last couple of data logs will be the ones I want. Or wouldn't this search take very long? It's a 1.5MB file (and growing).

    2. How would I go about completing the pass? I'll learn a lot more if you don't write the code for me, but could you provide me with a rough plan and key c++ keywords I would probably use? Then I can look them up and slowly figure it all out.

    3. Would you ever load a 1.5MB file as an entire string? Would that take very long on an average computer today (which is what I am using)?

    I hope you don't mind being bombarded with questions!

  7. #7
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    Post Sounds good...

    Quote Originally Posted by EVOEx View Post
    If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.
    This seems like the more efficient way to me, and hopefully easier, as I am not entirely sure how you would go about the pass you are describing pianorain. I guess my case is more specific too, in that I know the lines I am searching for are going to be at the end of the file, so I don't really need to search the rest of it.

    But I'll wait and see whether pianorain can give me a bit more of a description, and whether my questions are easily answered.

  8. #8
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    Ah, that raises some questions. Are you constantly adding data to the same file? If so, you could write another file with the position of the image headers and the position of the end of the file (in other words, the position of the next image that would be added to the file). Then when you re-run the program, it could load the positions file, look for new headers from the end position, and save out any new image positions and the new end of file.

    Actually, I wouldn't make this program look for new headers every run. I'd only make it look for new headers in the data file if I asked it to retrieve an image for which it didn't have a header position yet. That allows you to retrieve all the images for which you have a header pretty quickly, even if the original data file has grown.

    Of course, that program may be more complicated than what you really need. If you know for certain you only need data at the very end of the file and you're not interested in getting data back from anywhere else, EVOEx's suggestion is sufficient.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I dealt with a similar problem in my PDF library. The first step in parsing a PDF involves locating the string "%%EOF" near the end of the file. The problem is exacerbated by the fact that more than one %%EOF can occur in the file -- the one which is important is the last one to occur.

    In the end I took the stupid approach -- seeking to 5 characters before the end of the file, looking if %%EOF is there -- if not, seek backward one character and try again, and so on, until it is found.

    It is a single-shot operation and so the performance really didn't matter, especially because %%EOF almost always occurs very close to the actual end of the file, and it doesn't require too much seeking to find it.

    I considered a "better" algorithm which would load large blocks then search them backward for the given string, dealing with block crossings as mentioned by EVOEx. But it would have been a pain, difficult to understand, and to very little benefit. Sometimes the stupid solution is the better solution.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Quote Originally Posted by pianorain View Post
    This is possible, but I have to think it would be slower than reading the entire file through from beginning to end. Every seek basically trashes whatever caching mechanism your system uses to speed up file i/o, and you'll have to seek backwards one characters for every read. Your first suggestion doesn't seem like a bad way to do it, though.
    I actually meant combining that with the X-character read (best value for X would probably be the block size), and then parsing it backways as input for the finite state machine.

  11. #11
    Registered User
    Join Date
    Aug 2010
    Posts
    14
    Quote Originally Posted by pianorain View Post
    If you know for certain you only need data at the very end of the file and you're not interested in getting data back from anywhere else, EVOEx's suggestion is sufficient.
    Yes. That's the case. So I think I will go with EVOEx's suggestion.

  12. #12
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    Question

    Quote Originally Posted by EVOEx View Post
    If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.
    So do I load the whole text file as one string?
    Code:
    #include <iostream>
    #include <string>
    #include <conio.h> //needed for getch()
    #include <sstream>
    using namespace std;
    
    int main()
    {
       ifstream inFile;
    	string searchString("image");
       //string lineOfText("image 236");
       inFile.open("C:\\datafile.txt", ifstream::in);
    So I read that ifstream::in "allows input operations on the stream." Do I enable this?

  13. #13
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    You do not strictly need to tell open () that you are using ios::in mode -- and it is ios::in -- in this case because the stream type is ifstream, whose objects can only read files. The mode parameter in open () would be useful when you want to read a binary file
    -- ios::in | ios::binary -- or do something in addition to opening for reading, so bear in mind that is the reason there are two ways to call open ().

    Also, consider re-arranging your input, if you can, so that more recent images are at the top.
    Last edited by whiteflags; 09-03-2010 at 10:31 PM.

  14. #14
    Anti-Poster
    Join Date
    Feb 2002
    Posts
    1,401
    Quote Originally Posted by Geek10 View Post
    So do I load the whole text file as one string?
    No. Use seekg to seek to chunk_size from the end. Read chunk_size bytes from the file stream, checking for your 'image' string. If it's not in there, seekg chunk_size * 2 from the end and try again.

    If 'image' is the only character string in the file, checking to see if it happens on a chunk boundary is pretty simple. All you're really looking for is a 'g'. As soon as you've got a 'g', you should be able to figure out where the image number is.
    If I did your homework for you, then you might pass your class without learning how to write a program like this. Then you might graduate and get your degree without learning how to write a program like this. You might become a professional programmer without knowing how to write a program like this. Someday you might work on a project with me without knowing how to write a program like this. Then I would have to do you serious bodily harm. - Jack Klein

  15. #15
    Registered User
    Join Date
    Aug 2010
    Posts
    14

    Talking My answer

    Thanks for all of the help, pianorain, whiteflags, EVOEx and brewbuck.

    For completeness, and for people searching the internet for an answer to a similar problem (like I was), the final code is pasted below.

    Code:
    #include <iostream>
    #include <string>
    #include <conio.h> //needed for getch()
    #include <sstream>
    #include <fstream>
    using namespace std;
    
    #define numLidarBeams 529	// Total LIDAR beams per scan
    
    struct IMUobject{	// IMU data
    	float yaw;
    	float pitch;
    	float roll;
    };
    
    struct GPSobject{	// GPS data
    	int easting;
    	float northing;
    	float altitude;
    	float speed;
    };
    
    struct Lidarobject{	// LiDAR
    	int Readings [numLidarBeams];
    };
    
    struct Dataobject{
    	IMUobject curIMU;
    	GPSobject curGPS;
    	Lidarobject curLidar;
    };
    
    int main()
    {
    	//Initialisation
    	int desiredImageNumber = 9;
    	int imgNum;				// Initialise the integer for the found image number value
    	int bufferSize = 500;
    	string searchString("image");
    	long int currentPtrPosn; // Where the pointer is up to searching
    	int eofPtrPosn;
    	Dataobject curData;
    	
    	ifstream inFile;	// Initiate the input file stream
    	inFile.open("C:\\datafile.txt", ifstream::in);
    
    	inFile.seekg (0, ios::end);	// Go to the end of the file
    	eofPtrPosn = inFile.tellg();	// Store the end of file address to an integer
    	eofPtrPosn = eofPtrPosn;
    	currentPtrPosn = inFile.tellg(); // IF ONLY I KNEW HOW TO DO THINGS INITIALLY THEN DO A LOOP
    	while (desiredImageNumber != imgNum){
    		char * buffer; 		// to store search search block of code
    		buffer = new char [bufferSize]; // allocate memory
    		// go back one buffer size in the text file (the 11 accounts for the possible overlap
    		// of "image #####" across the buffer ends except for the first read at the end of file
    		if(eofPtrPosn == currentPtrPosn){
    			currentPtrPosn = eofPtrPosn - bufferSize;
    		}else{
    			currentPtrPosn = currentPtrPosn - bufferSize + 11;
    		}
    		inFile.seekg (currentPtrPosn);
    		currentPtrPosn = inFile.tellg();
    		// read data into buffer
    		inFile.read (buffer,bufferSize);
    		
    		//display buffer read
    		cout << "The contents of the buffer are: " << endl;
    		cout << "________________________________________" << endl;
    		cout.write (buffer,bufferSize);
    		cout << endl << "________________________________________" << endl;
    		
    		// Put contents of buffer into a string
    		string bufferString(buffer);	
    		
    		// size_type imageStringAddr is the location of the first string matching searchString 
    		//IN the string bufferStringfound by .find
    		string::size_type imageStringAddr = bufferString.find(searchString, 0);
    		
    		if (imageStringAddr != string::npos){	// If the location of the first string is not at the end of the string
    			// This makes a string out of the characters following the space after "image"
    			string numberString = bufferString.substr(imageStringAddr + searchString.size() + 1);
    			// stringstream provides an interface to manipulate strings as if they were input/output streams.
    			stringstream numberStream(numberString);
    			if((numberStream >> imgNum).fail()){	// Convert stream to integer
    				cout << "ERROR: Failed string to integer conversion";
    			}
    			cout << "imgNum integer: " << imgNum << endl;
    			
    			// Find the number of digits in imgNum
    			int imgNumDigits = 0;
    			int step = 1;
    			while (step <= imgNum) {
    				imgNumDigits++;
    				step *= 10;
    			}		
    			// The "+2" accounts for buffer overlap, not sure why - "\n"? perhaps?
    			inFile.seekg (currentPtrPosn + imageStringAddr + searchString.size() + imgNumDigits + 2);
    		}
      		delete[] buffer;
    	}
    	inFile >> curData.curGPS.easting;
    	cout << "easting: " << curData.curGPS.easting << endl;
    	inFile >> curData.curGPS.northing;
    	inFile >> curData.curGPS.altitude;
    	inFile >> curData.curGPS.speed;
    	inFile >> curData.curIMU.roll;
    	inFile >> curData.curIMU.pitch;
    	inFile >> curData.curIMU.yaw;
    	
    	for(int i=0; i<numLidarBeams; i++){
    	  inFile >> curData.curLidar.Readings[i];
    	}
    	cout << "curData.curLidar.Readings[528]: " << curData.curLidar.Readings[528] << endl;
    	
    	inFile.close();
    	getch();
       return 0;
    }
    Many thanks to all those who helped with advice!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 8
    Last Post: 05-05-2010, 02:43 PM
  2. Help with finding word frequency in a text file.
    By aeolusaether in forum C Programming
    Replies: 15
    Last Post: 04-04-2010, 09:59 PM
  3. Searching a text file for double quotation marks "
    By willie in forum C Programming
    Replies: 4
    Last Post: 11-23-2008, 02:00 PM
  4. Removing text between /* */ in a file
    By 0rion in forum C Programming
    Replies: 2
    Last Post: 04-05-2004, 08:54 AM

Tags for this Thread