Searching a text file backwards

**Geek10** · 09-02-2010

Hi all,

I thought this was an easy problem, but I haven't been able to find an answer in the last hour or so of searching, so maybe someone can direct me.

I have a text file that looks like this:

Code:

image 2318
281253 6.13325e+06 35.8228 0.0812 
-3.13765 0.0117648 -1.07704
1150 1156 1164 1176 870 268 312 308 272 254 286 260 278 267 335 253 315...
725 741 749 757 761 787 803 967 975 971 971 979 1002 980 2086 2130 2168...
...
image 2319
283453 6.13525e+06 36.9328 0.0812
(repeating again)

I want to find the number after the word "image". In a previous post, I was helped to do this. However, I would like to know how to search backwards in the text file, because it grows to a few MB pretty quickly and the set of data I require will be either the last set of data or close to last set before the end of the text file.

I thought there was a flag or something I could go to such as eof() or something, but couldn't seem to find it (feof - C++ Reference doesn't seem to apply).

I can't find a .find backwards option either.

Perhaps I could read in the last section of code into a buffer big enough to hold a datalog, and then search that forwards? I don't really have much of an idea of how to do this though.

Thanks in advance for any suggestions, particularly with code, as I am out of my depth (but love a challenge).

Geek10

**Geek10** · 09-02-2010

Aha! I just discovered a really good resource - past posts by users of this forum! I thought there would be more help on this sort of problem that Google would tell me.

So I have found that string::rfind is the search tool I need. Now I just need some pointers as to how to use it to search backwards. Is loading a buffer a good idea? I think loading the whole text file to a string is inefficient (and probably impossible).

Does anyone know how I can get the last chunk of a text file into a string?

Thanks,

Geek10

**pianorain** · 09-02-2010

Part of the problem is that it's a text file. From any given position, you can't know where to seek to for the next line or the previous line. You have to read through the file to figure it out.

Instead of trying to read the file backwards, I'd probably accomplish this with a sort of two-pass algorithm. First, get the file position of each image header, skipping past the rest of the data. If you know that each image takes up a certain number of lines, you can read and skip that many lines ahead; if not, you'll have to check each line for the "image" string. When you've completed this pass, you'll know to which position to seek for any image in the file. Second, seek back to the desired image and read all the data in.

**EVOEx** · 09-02-2010

In most cases I'm with pianorain. Only in very specific, uncommon cases it's a good idea to do exactly what you want.

If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.

You'll have to be careful about cases like where "image" spans across the 1024 character boundaries, so take that into account. All in all, pianorain's idea is easier.
You could use a finite state machine to match "egami\n" ("\nimage" reversed), and run back through it as input for the state machine, but still pianorain's idea is easier and usually better.

**pianorain** · 09-02-2010

Originally Posted by EVOEx

You could use a finite state machine to match "egami\n" ("\nimage" reversed), and run back through it as input for the state machine

This is possible, but I have to think it would be slower than reading the entire file through from beginning to end. Every seek basically trashes whatever caching mechanism your system uses to speed up file i/o, and you'll have to seek backwards one characters for every read. Your first suggestion doesn't seem like a bad way to do it, though.

**Geek10** · 09-02-2010

Originally Posted by pianorain

Part of the problem is that it's a text file. From any given position, you can't know where to seek to for the next line or the previous line. You have to read through the file to figure it out.

Instead of trying to read the file backwards, I'd probably accomplish this with a sort of two-pass algorithm. First, get the file position of each image header, skipping past the rest of the data. If you know that each image takes up a certain number of lines, you can read and skip that many lines ahead; if not, you'll have to check each line for the "image" string. When you've completed this pass, you'll know to which position to seek for any image in the file. Second, seek back to the desired image and read all the data in.

I'm open for ideas. That's what I'm here for. But I do have a few questions about this technique above:
1. Would I do this "image" header file position search every single time I want to read the text file? Remember that I am adding new data to it all of the time, and usually the latest or last couple of data logs will be the ones I want. Or wouldn't this search take very long? It's a 1.5MB file (and growing).

2. How would I go about completing the pass? I'll learn a lot more if you don't write the code for me, but could you provide me with a rough plan and key c++ keywords I would probably use? Then I can look them up and slowly figure it all out.

3. Would you ever load a 1.5MB file as an entire string? Would that take very long on an average computer today (which is what I am using)?

I hope you don't mind being bombarded with questions!

**Geek10** · 09-02-2010

Originally Posted by EVOEx

If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.

This seems like the more efficient way to me, and hopefully easier, as I am not entirely sure how you would go about the pass you are describing pianorain. I guess my case is more specific too, in that I know the lines I am searching for are going to be at the end of the file, so I don't really need to search the rest of it.

But I'll wait and see whether pianorain can give me a bit more of a description, and whether my questions are easily answered.

**pianorain** · 09-02-2010

Ah, that raises some questions. Are you constantly adding data to the same file? If so, you could write another file with the position of the image headers and the position of the end of the file (in other words, the position of the next image that would be added to the file). Then when you re-run the program, it could load the positions file, look for new headers from the end position, and save out any new image positions and the new end of file.

Actually, I wouldn't make this program look for new headers every run. I'd only make it look for new headers in the data file if I asked it to retrieve an image for which it didn't have a header position yet. That allows you to retrieve all the images for which you have a header pretty quickly, even if the original data file has grown.

Of course, that program may be more complicated than what you really need. If you know for certain you only need data at the very end of the file and you're not interested in getting data back from anywhere else, EVOEx's suggestion is sufficient.

**brewbuck** · 09-02-2010

I dealt with a similar problem in my PDF library. The first step in parsing a PDF involves locating the string "%%EOF" near the end of the file. The problem is exacerbated by the fact that more than one %%EOF can occur in the file -- the one which is important is the last one to occur.

In the end I took the stupid approach -- seeking to 5 characters before the end of the file, looking if %%EOF is there -- if not, seek backward one character and try again, and so on, until it is found.

It is a single-shot operation and so the performance really didn't matter, especially because %%EOF almost always occurs very close to the actual end of the file, and it doesn't require too much seeking to find it.

I considered a "better" algorithm which would load large blocks then search them backward for the given string, dealing with block crossings as mentioned by EVOEx. But it would have been a pain, difficult to understand, and to very little benefit. Sometimes the stupid solution is the better solution.

**EVOEx** · 09-03-2010

Originally Posted by pianorain

This is possible, but I have to think it would be slower than reading the entire file through from beginning to end. Every seek basically trashes whatever caching mechanism your system uses to speed up file i/o, and you'll have to seek backwards one characters for every read. Your first suggestion doesn't seem like a bad way to do it, though.

I actually meant combining that with the X-character read (best value for X would probably be the block size), and then parsing it backways as input for the finite state machine.

**Geek10** · 09-03-2010

Originally Posted by pianorain

If you know for certain you only need data at the very end of the file and you're not interested in getting data back from anywhere else, EVOEx's suggestion is sufficient.

Yes. That's the case. So I think I will go with EVOEx's suggestion.

**Geek10** · 09-03-2010

Originally Posted by EVOEx

If you still think you should do it the way you wanted, the best way is probably to get to the end of the file using seekg, then use seek to seek a number of characters back (eg. 1024), then read 1024 characters, find the places it has newlines, then compare those places with what you want to match. If it's not there, go back another 1024 characters and try again.

So do I load the whole text file as one string?

Code:

#include <iostream>
#include <string>
#include <conio.h> //needed for getch()
#include <sstream>
using namespace std;

int main()
{
   ifstream inFile;
	string searchString("image");
   //string lineOfText("image 236");
   inFile.open("C:\\datafile.txt", ifstream::in);

So I read that ifstream::in "allows input operations on the stream." Do I enable this?

**whiteflags** · 09-03-2010

You do not strictly need to tell open () that you are using ios::in mode -- and it is ios::in -- in this case because the stream type is ifstream, whose objects can only read files. The mode parameter in open () would be useful when you want to read a binary file
-- ios::in | ios::binary -- or do something in addition to opening for reading, so bear in mind that is the reason there are two ways to call open ().

Also, consider re-arranging your input, if you can, so that more recent images are at the top.

**pianorain** · 09-04-2010

Originally Posted by Geek10

So do I load the whole text file as one string?

No. Use seekg to seek to chunk_size from the end. Read chunk_size bytes from the file stream, checking for your 'image' string. If it's not in there, seekg chunk_size * 2 from the end and try again.

If 'image' is the only character string in the file, checking to see if it happens on a chunk boundary is pretty simple. All you're really looking for is a 'g'. As soon as you've got a 'g', you should be able to figure out where the image number is.

**Geek10** · 09-09-2010

Thanks for all of the help, pianorain, whiteflags, EVOEx and brewbuck.

For completeness, and for people searching the internet for an answer to a similar problem (like I was), the final code is pasted below.

Code:

#include <iostream>
#include <string>
#include <conio.h> //needed for getch()
#include <sstream>
#include <fstream>
using namespace std;

#define numLidarBeams 529	// Total LIDAR beams per scan

struct IMUobject{	// IMU data
	float yaw;
	float pitch;
	float roll;
};

struct GPSobject{	// GPS data
	int easting;
	float northing;
	float altitude;
	float speed;
};

struct Lidarobject{	// LiDAR
	int Readings [numLidarBeams];
};

struct Dataobject{
	IMUobject curIMU;
	GPSobject curGPS;
	Lidarobject curLidar;
};

int main()
{
	//Initialisation
	int desiredImageNumber = 9;
	int imgNum;				// Initialise the integer for the found image number value
	int bufferSize = 500;
	string searchString("image");
	long int currentPtrPosn; // Where the pointer is up to searching
	int eofPtrPosn;
	Dataobject curData;
	
	ifstream inFile;	// Initiate the input file stream
	inFile.open("C:\\datafile.txt", ifstream::in);

	inFile.seekg (0, ios::end);	// Go to the end of the file
	eofPtrPosn = inFile.tellg();	// Store the end of file address to an integer
	eofPtrPosn = eofPtrPosn;
	currentPtrPosn = inFile.tellg(); // IF ONLY I KNEW HOW TO DO THINGS INITIALLY THEN DO A LOOP
	while (desiredImageNumber != imgNum){
		char * buffer; 		// to store search search block of code
		buffer = new char [bufferSize]; // allocate memory
		// go back one buffer size in the text file (the 11 accounts for the possible overlap
		// of "image #####" across the buffer ends except for the first read at the end of file
		if(eofPtrPosn == currentPtrPosn){
			currentPtrPosn = eofPtrPosn - bufferSize;
		}else{
			currentPtrPosn = currentPtrPosn - bufferSize + 11;
		}
		inFile.seekg (currentPtrPosn);
		currentPtrPosn = inFile.tellg();
		// read data into buffer
		inFile.read (buffer,bufferSize);
		
		//display buffer read
		cout << "The contents of the buffer are: " << endl;
		cout << "________________________________________" << endl;
		cout.write (buffer,bufferSize);
		cout << endl << "________________________________________" << endl;
		
		// Put contents of buffer into a string
		string bufferString(buffer);	
		
		// size_type imageStringAddr is the location of the first string matching searchString 
		//IN the string bufferStringfound by .find
		string::size_type imageStringAddr = bufferString.find(searchString, 0);
		
		if (imageStringAddr != string::npos){	// If the location of the first string is not at the end of the string
			// This makes a string out of the characters following the space after "image"
			string numberString = bufferString.substr(imageStringAddr + searchString.size() + 1);
			// stringstream provides an interface to manipulate strings as if they were input/output streams.
			stringstream numberStream(numberString);
			if((numberStream >> imgNum).fail()){	// Convert stream to integer
				cout << "ERROR: Failed string to integer conversion";
			}
			cout << "imgNum integer: " << imgNum << endl;
			
			// Find the number of digits in imgNum
			int imgNumDigits = 0;
			int step = 1;
			while (step <= imgNum) {
				imgNumDigits++;
				step *= 10;
			}		
			// The "+2" accounts for buffer overlap, not sure why - "\n"? perhaps?
			inFile.seekg (currentPtrPosn + imageStringAddr + searchString.size() + imgNumDigits + 2);
		}
  		delete[] buffer;
	}
	inFile >> curData.curGPS.easting;
	cout << "easting: " << curData.curGPS.easting << endl;
	inFile >> curData.curGPS.northing;
	inFile >> curData.curGPS.altitude;
	inFile >> curData.curGPS.speed;
	inFile >> curData.curIMU.roll;
	inFile >> curData.curIMU.pitch;
	inFile >> curData.curIMU.yaw;
	
	for(int i=0; i<numLidarBeams; i++){
	  inFile >> curData.curLidar.Readings[i];
	}
	cout << "curData.curLidar.Readings[528]: " << curData.curLidar.Readings[528] << endl;
	
	inFile.close();
	getch();
   return 0;
}

Many thanks to all those who helped with advice!

Thread: Searching a text file backwards

Thread Tools

Search Thread

Display

Searching a text file backwards

string::rfind

A couple of questions

Sounds good...

My answer

Similar Threads

need help with writing to text file / reading from text file

Help with finding word frequency in a text file.

Searching a text file for double quotation marks "

Removing text between /* */ in a file

Tags for this Thread