Thread: what do you do to avoid loading large dataset again and again

  1. #1
    Registered User
    Join Date
    Dec 2008
    Posts
    48

    what do you do to avoid loading large dataset again and again

    hey, there,

    At beginning of my code , I first read a large data file and then do analysis of the dataset. I never change the dataset. The datafile is large, >=300M. I frequently change the analysis part and run test. Everytime I need to load the large datafile again, which has been unbearable.

    In Matlab, I can read the datafile at beginning of the session, and then do all kinds of analysis without need to reload the data. So I am thinking of establishing a separate process which reads the datafile, and stay there all day responding to data request from my analysis routine. but I am not sure how to do it in C++. or maybe there is a better method.

    please kindly share with me how you cope with this situation. Thanks.

    Michael

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    How are you reading the file? How long does it take? My guess would be that reading the file is much less of a problem than your METHOD of reading the file (or some other processing that is part of reading the file). It shouldn't take very long to read 300MB of data is you do it in relatively large blocks.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    The larch
    Join Date
    May 2006
    Posts
    3,573
    Run the tests with a smaller data set? (I assume you are testing changes to the program and it's not normal usage of the program to rewrite and recompile it...)
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  4. #4
    Registered User
    Join Date
    Nov 2006
    Posts
    519
    Quote Originally Posted by matsp View Post
    It shouldn't take very long to read 300MB of data is you do it in relatively large blocks. Mats
    Some times this is not possible so easily, for example if you recreate object sets using some serialization library.

    Once I used boost::interprocess to hold the data set inside a designated process. The real reason though was to allow data processing by several processes without the need to load the data into every single process. But of course your problem was solved too. Depending an your kind of data this method can be more or less tricky. For tree structures for example you have to take care converting pointers and precisely manage memory layout inside your processes. But boost also has a bit support for that

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by pheres View Post
    Some times this is not possible so easily, for example if you recreate object sets using some serialization library.
    True. But on the other hand, if the reason for it taking quite some time is that it's going through a serialization interface or some such, then perhaps that in itself is a bad design. All design decisions have to be measured on a compromise between performance, stability/robustness, ease of implementation and ability to extend/future proof. Which is most important depends on the circumstances - but if the serialization protocol requires an inordinate amount of time compared to a more direct appraoch, and the file is normally very large, then perhaps using a more direct approach is the right way to go.

    I just wrote and read back 300MB of data, and it takes 7 seconds to write, and zero seconds to read - presumably because the read is done with all the data in the cache.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Registered User
    Join Date
    Dec 2008
    Posts
    48

    reply: how i read the file

    Quote Originally Posted by matsp View Post
    How are you reading the file? How long does it take? My guess would be that reading the file is much less of a problem than your METHOD of reading the file (or some other processing that is part of reading the file). It shouldn't take very long to read 300MB of data is you do it in relatively large blocks.

    --
    Mats
    it is a very simple csv file with three fields:
    time, double, double

    but I have 11,737,015 lines. "wc" it on my machine takes more than 1 minutes. The reader uses getline() and the following subroutine to decompose it into fields.
    Code:
    void onecsvline(string linebuf, time_t& t, double& bid, double& ask) {
    	string partbuf;
    	istringstream s(linebuf);//turn this line into
    	std::getline(s, partbuf, ',');
    	istringstream ss(partbuf);
    	ss >> t;
    	std::getline(s, partbuf, ',');
    	istringstream sss(partbuf);
    	sss >> bid;
    	std::getline(s, partbuf, ',');
    	istringstream ssss(partbuf);
    	ssss >> ask;
    }
    so for file, I will call this routine 11,737,015 times. I don't quite get what you mean read in large blocks. Maybe that's exactly what I shall do to improve efficiency. Please kindly show me a example. Thanks.

  7. #7
    Registered User
    Join Date
    Dec 2008
    Posts
    48
    Quote Originally Posted by anon View Post
    Run the tests with a smaller data set? (I assume you are testing changes to the program and it's not normal usage of the program to rewrite and recompile it...)
    You are right. I used to test it on small set in the last a few weeks. At this time point, I need to test it over large set more often. And it becomes a pain on neck.

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    1 minute is probably not that bad. I'm pretty sure, however, that your code would run a fair bit faster if you did a few less stringstream creations, and a bit more "manual" parsing of the input.

    Have you measured how much CPU-time your application is using during the processing of the file?

    One solution to "preload" the data is to use a memory mapped file (mmap in Linux/Unix, MapViewOfFile Function (Windows) in Windows).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #9
    The larch
    Join Date
    May 2006
    Posts
    3,573
    There's this programming challenges site and it seems that C++ input routines may just be too slow for some problems. With large datasets one would use C i/o routines for extra speed (but then the input is known to be well-formatted in a simple way).
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  10. #10
    Registered User
    Join Date
    Dec 2008
    Posts
    48
    1 minute is not too bad, and I am not going to spend hours to squeeze a few seconds out of it. The problem is that I need to call the function many times (>=40?) during the day. And it is not fun to wait 1 minute or 50 seconds everytime. Maybe the Boost::interprocess mentioned early is a solution for me. Thanks guys!

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by patiobarbecue View Post
    1 minute is not too bad, and I am not going to spend hours to squeeze a few seconds out of it. The problem is that I need to call the function many times (>=40?) during the day. And it is not fun to wait 1 minute or 50 seconds everytime. Maybe the Boost::interprocess mentioned early is a solution for me. Thanks guys!
    Sure, we all have that problem - it takes a while to compile, build and produce a ROM for me at work - probably 5-6 minutes "per go". And there's little I can do to improve that.

    Rewriting the format of the file to make it easier to read may also help - just read the text-file and produce a binary "raw" data file that you can just load into an array or vector -should be a fair bit faster. It takes my machine about 7-9 seconds to read a 300MB binary file.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User
    Join Date
    Dec 2008
    Posts
    48
    thanks, reading binary file is much faster!

  13. #13
    Registered User
    Join Date
    Nov 2006
    Posts
    519
    Quote Originally Posted by matsp View Post
    One solution to "preload" the data is to use a memory mapped file (mmap in Linux/Unix, MapViewOfFile Function (Windows) in Windows).

    --
    Mats
    Is it somehow possible to create in std fstream object to a memory mapped file? This would solve the problem with the serialization library.

Popular pages Recent additions subscribe to a feed

Tags for this Thread