Thread: how to read in files that are bigger than 3gigabyte?

  1. #1
    Registered User
    Join Date
    Dec 2004
    Posts
    163

    how to read in files that are bigger than 3gigabyte?

    I got a 18 gigabytes dataset, and I try using

    ipFile.open()

    and

    getline()

    to read in the data, but it has problems, as I am using 32bit microsoft visual studio 2005, and 18gigabytes exceeded the memory address of 2^32. Do you guys have any ideas how to solve it?

    Thank you.

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Read and process the file in chunks.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    As CornedBee says.

    You could, in theory, get a machine with at least 20GB of memory, and a 64-bit OS, and you could read the whole file in. But that is probably fairly inefficient if [1] you do not absolutely need to have all elements at the same time.

    [1] The time it takes to load the file will be proportional to it's size. To read a small part at the time will take almost the same time, and the processing of the chunks would be the same amount as the whole file at once [most likely]. If you really want to do it well, use asynchronous (overlapped in Windows terms) IO to perform read and write operations whilst the processing is done, so that you have the requested data ready sooner.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  4. #4
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    but what if the file is given by someone else and how do i cut them up into chunks?

    i am reading up on the internet now, seems like java is able to do this. but is c++ able to do this in linux environment,

    using fread64 and fwrite64 and setting O_LARGEFILE flag in gcc?

    but does window xp allow this?

  5. #5
    The superhaterodyne twomers's Avatar
    Join Date
    Dec 2005
    Location
    Ireland
    Posts
    2,273
    What do you have to do with the file?

    Consider a DVD. The media player you use only buffers so much of it and continuously updates what it'll need in the future. DO this if possible.

  6. #6
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Of course it is possible. But the problem may seem to be how a file pointer be able to save the position of such a big file if pointers are generally 4bytes in a 32bit system. Is there something I am missing?

  7. #7
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    All modern OSs store file positions as at least a 64-bit number internally. You'd need a very large file to confuse such an OS.
    Of course, whether your file API and program can handle those numbers is a different question. However, as long as you only read the file sequentially in reasonably-sized chunks (a few kilobytes to a few hundred megabytes), that's not an issue.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  8. #8
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    fopen64()... You have several OS specific solutions as well. Using system calls only, this is not actually too difficult of a task at all. However, since portability may be an issue for you, maybe you do not want such a platform specific option.

  9. #9
    Registered User
    Join Date
    Jul 2003
    Posts
    110
    Quote Originally Posted by master5001 View Post
    fopen64()... You have several OS specific solutions as well. Using system calls only, this is not actually too difficult of a task at all. However, since portability may be an issue for you, maybe you do not want such a platform specific option.
    Then you can derive your own streambuffer from std::streambuf and isolate the specific operations. There's no obvious need to stitch that stuff directly into the program. I believe you can still use iostreams to do this if your OS provides the basic tools like fopen64.

  10. #10
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    True. I am not a huge fan of the STL methods of file handling... I tend to just use the standard C functions (everyone can say whatever they want, I am just kind of set in my ways... and I do use the STL file streams sometimes... Its just a matter of archaic preference).

  11. #11
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    Standard c or c++ functions does not work for a file size of 13gb.

    I use a simple code, and it can't even read the first line of the file. Each line in the input file is at most 100 characters
    Code:
                    string line;
    		ifstream ipfile1;
    		ipfile1.open( result_file.c_str() );	
    		if (ipfile1.is_open() == false)
    		{	
    			cout << "Input file cannot be opened \n";
    			exit(0);
    		}
    		int row_num1=0;
    		while (getline(ipfile1, line)) 
    			row_num1++;

  12. #12
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    I'll take a guess (I like guessing): you didn't strip off the newline at the end of result_file before you tried to use it to open ipfile1.

    (Or in other words: can you open a 100-byte file with the same code? If not, the problem isn't the size.)

  13. #13
    Registered User
    Join Date
    Dec 2004
    Posts
    163
    Quote Originally Posted by tabstop View Post
    I'll take a guess (I like guessing): you didn't strip off the newline at the end of result_file before you tried to use it to open ipfile1.

    (Or in other words: can you open a 100-byte file with the same code? If not, the problem isn't the size.)
    Sorry, I don't really understand what you mean. I can use this code to open files or few hundred megabytes, this code is to count the number of lines in result_file, where each line is ended by a newline

  14. #14
    Registered User
    Join Date
    Jul 2003
    Posts
    110
    Quote Originally Posted by franziss View Post
    Sorry, I don't really understand what you mean. I can use this code to open files or few hundred megabytes, this code is to count the number of lines in result_file, where each line is ended by a newline
    I believe tabstop was suspecting that your string result_file had a newline appended to it, and the OS couldn't find that file.

    I think your suspicions are probably right though. The implementation of the C and C++ IO libraries for your platform seems to choke on files larger than 4GB, if this is correct.

    It's not a loss though, you can try to find an implementation that uses CreateFile, ReadFile, and WriteFile (Windows API functions) instead of the C standard library ones. If you can't, it's not that hard to write your own in a pinch if you know iostreams already. If you don't, is now a good time to learn it? If all else fails, maybe Boost has something, but you'd probably have to rewrite some of your program to use it.

  15. #15
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Boost.IOStreams with a file_source can probably fill the gap.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to read files into a struct
    By verbity in forum C Programming
    Replies: 16
    Last Post: 11-24-2006, 03:39 PM
  2. Read INDEX.DAT files?
    By Queatrix in forum Windows Programming
    Replies: 3
    Last Post: 09-08-2006, 08:03 AM
  3. using threads to write & read from files
    By kishorepalle in forum C Programming
    Replies: 4
    Last Post: 10-19-2004, 05:19 PM
  4. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  5. Replies: 1
    Last Post: 07-24-2002, 06:33 AM