what do you do to avoid loading large dataset again and again

**patiobarbecue** · 04-18-2009

hey, there,

At beginning of my code , I first read a large data file and then do analysis of the dataset. I never change the dataset. The datafile is large, >=300M. I frequently change the analysis part and run test. Everytime I need to load the large datafile again, which has been unbearable.

In Matlab, I can read the datafile at beginning of the session, and then do all kinds of analysis without need to reload the data. So I am thinking of establishing a separate process which reads the datafile, and stay there all day responding to data request from my analysis routine. but I am not sure how to do it in C++. or maybe there is a better method.

please kindly share with me how you cope with this situation. Thanks.

Michael

**matsp** · 04-18-2009

How are you reading the file? How long does it take? My guess would be that reading the file is much less of a problem than your METHOD of reading the file (or some other processing that is part of reading the file). It shouldn't take very long to read 300MB of data is you do it in relatively large blocks.

--
Mats

**anon** · 04-18-2009

Run the tests with a smaller data set? (I assume you are testing changes to the program and it's not normal usage of the program to rewrite and recompile it...)

**pheres** · 04-18-2009

Originally Posted by matsp

It shouldn't take very long to read 300MB of data is you do it in relatively large blocks. Mats

Some times this is not possible so easily, for example if you recreate object sets using some serialization library.

Once I used boost::interprocess to hold the data set inside a designated process. The real reason though was to allow data processing by several processes without the need to load the data into every single process. But of course your problem was solved too. Depending an your kind of data this method can be more or less tricky. For tree structures for example you have to take care converting pointers and precisely manage memory layout inside your processes. But boost also has a bit support for that

**matsp** · 04-18-2009

Originally Posted by pheres

Some times this is not possible so easily, for example if you recreate object sets using some serialization library.

True. But on the other hand, if the reason for it taking quite some time is that it's going through a serialization interface or some such, then perhaps that in itself is a bad design. All design decisions have to be measured on a compromise between performance, stability/robustness, ease of implementation and ability to extend/future proof. Which is most important depends on the circumstances - but if the serialization protocol requires an inordinate amount of time compared to a more direct appraoch, and the file is normally very large, then perhaps using a more direct approach is the right way to go.

I just wrote and read back 300MB of data, and it takes 7 seconds to write, and zero seconds to read - presumably because the read is done with all the data in the cache.

--
Mats

**patiobarbecue** · 04-18-2009

Originally Posted by matsp

How are you reading the file? How long does it take? My guess would be that reading the file is much less of a problem than your METHOD of reading the file (or some other processing that is part of reading the file). It shouldn't take very long to read 300MB of data is you do it in relatively large blocks.

--
Mats

it is a very simple csv file with three fields:
time, double, double

but I have 11,737,015 lines. "wc" it on my machine takes more than 1 minutes. The reader uses getline() and the following subroutine to decompose it into fields.

Code:

void onecsvline(string linebuf, time_t& t, double& bid, double& ask) {
	string partbuf;
	istringstream s(linebuf);//turn this line into
	std::getline(s, partbuf, ',');
	istringstream ss(partbuf);
	ss >> t;
	std::getline(s, partbuf, ',');
	istringstream sss(partbuf);
	sss >> bid;
	std::getline(s, partbuf, ',');
	istringstream ssss(partbuf);
	ssss >> ask;
}

so for file, I will call this routine 11,737,015 times. I don't quite get what you mean read in large blocks. Maybe that's exactly what I shall do to improve efficiency. Please kindly show me a example. Thanks.

**patiobarbecue** · 04-18-2009

Originally Posted by anon

Run the tests with a smaller data set? (I assume you are testing changes to the program and it's not normal usage of the program to rewrite and recompile it...)

You are right. I used to test it on small set in the last a few weeks. At this time point, I need to test it over large set more often. And it becomes a pain on neck.

**matsp** · 04-18-2009

1 minute is probably not that bad. I'm pretty sure, however, that your code would run a fair bit faster if you did a few less stringstream creations, and a bit more "manual" parsing of the input.

Have you measured how much CPU-time your application is using during the processing of the file?

One solution to "preload" the data is to use a memory mapped file (mmap in Linux/Unix, MapViewOfFile Function (Windows) in Windows).

--
Mats

**anon** · 04-18-2009

There's this programming challenges site and it seems that C++ input routines may just be too slow for some problems. With large datasets one would use C i/o routines for extra speed (but then the input is known to be well-formatted in a simple way).

**patiobarbecue** · 04-18-2009

1 minute is not too bad, and I am not going to spend hours to squeeze a few seconds out of it. The problem is that I need to call the function many times (>=40?) during the day. And it is not fun to wait 1 minute or 50 seconds everytime. Maybe the Boost::interprocess mentioned early is a solution for me. Thanks guys!

**matsp** · 04-18-2009

Originally Posted by patiobarbecue

1 minute is not too bad, and I am not going to spend hours to squeeze a few seconds out of it. The problem is that I need to call the function many times (>=40?) during the day. And it is not fun to wait 1 minute or 50 seconds everytime. Maybe the Boost::interprocess mentioned early is a solution for me. Thanks guys!

Sure, we all have that problem - it takes a while to compile, build and produce a ROM for me at work - probably 5-6 minutes "per go". And there's little I can do to improve that.

Rewriting the format of the file to make it easier to read may also help - just read the text-file and produce a binary "raw" data file that you can just load into an array or vector -should be a fair bit faster. It takes my machine about 7-9 seconds to read a 300MB binary file.

--
Mats

**patiobarbecue** · 04-18-2009

thanks, reading binary file is much faster!

**pheres** · 04-19-2009

Originally Posted by matsp

One solution to "preload" the data is to use a memory mapped file (mmap in Linux/Unix, MapViewOfFile Function (Windows) in Windows).

--
Mats

Is it somehow possible to create in std fstream object to a memory mapped file? This would solve the problem with the serialization library.

Thread: what do you do to avoid loading large dataset again and again

Thread Tools

Search Thread

Display

what do you do to avoid loading large dataset again and again

reply: how i read the file

Tags for this Thread