Thread: How to Handle heavy input files (~500MB)?

  1. #1
    Devil™
    Join Date
    Oct 2007
    Location
    IIT-Kharagpur, India
    Posts
    104

    How to Handle heavy input files (~500MB)?

    I want to know how to handle heavy input files that are around 500MB

    with normal school way of reading everything into memory wont be good..
    and it fails sometimes if the library cant go to that extent..

    how is this done in the industry?

    Example:
    Write a program to scan a text file of size ~500MB having sequence of numbers and can be scanned into an array.. using quick sort generate some output. the algorithm reads and writes to the same file..
    what is the efficient way to do it?
    efficient in the sense it wont get too heavy on memory..

    can we force the paging of our program with some defined criteria?

    And as the last thing..
    what if the file format is not normal txt, but a compressed file having its own specs?
    where it is not easy to get something meaningful from inside the file, without going through the starting part to that part
    C's Motto: who cares what it means? I just compile it!!

  2. #2
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> with normal school way of reading everything into memory wont be good..
    That depends on what you mean by "normal school way". If you mean by reading the entire file into memory, then that may not be good for a 500MB file - depending on available resources.
    If you mean using C/C++ standard file I/O - then that's only as good as the implementation. If you're using the MSCRT then it shouldn't be a problem.

    >> and it fails sometimes if the library cant go to that extent..
    I'm assuming here that you're talking about going beyond the maximum file size that standard I/O functions can represent - which is typically 2^32 bytes - so we're ok there with a 500MB file.

    >> efficient in the sense it wont get too heavy on memory..
    The easy solution is to simply think of the file itself as memory. Use "fseek()" to move your read/write pointer and perform the read or write. Since QSort is a "divide and conquer" algorithm, you can probably work on "chunks" at a time.

    >> what if the file format is not normal txt, but a compressed file having its own specs?
    In the end - you got to do what you got to do. (hows that for an answer) You may be forced to uncompress the file(s) and even split them up into manageable pieces. Or figure out how to update the file in-place.

    >> how is this done in the industry?
    Well, I typically look at "correctness" first, and then performance. Once you have something that works, you can then think about having multiple threads - there can be disk bound threads for reading and writing, and CPU bound threads for performing quicksort. Your bottleneck is still gonna be the disk but at least you can do some sorting while reading and writing.

    gg

  3. #3
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    1. Allocate a buffer of a pre-determined size
    2. Set filepos to 0
    3. Do a block read - length of buffer size
    4. Increment filepos by bytes read
    5. Do something with data
    6. Write out new data to new file
    7. Wash, rinse, repeat for length of file.
    8. Save old filename and delete file.
    9. Rename new data file to old filename.


    This is a simple caching scheme. What you should do though depends on a lot of different variables. What are you doing with the data? What is the format of the data?

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    You can't sort the entire file content using quick-sort - in fact, if you need to sort the entire file and the file is ~500MB, the file content would have to be read into memory. [Unless you do a quick-sort "on file" - which is of course possible, but it would be rather slow, because quick-sort accesses each member of the sorted data multiple times, which means reading/writing the data multiple times].

    Of course, if it's a text file containing numbers, then each number is most likely longer in the text file [unless ALL the numbers are really small integers]. A 32-bit integer [up to +/- 2000000000 and a bit] can be stored in 4 bytes, but takes up on average just under 11 bytes on disk, so a 500MB file will contain about 50M numbers, which makes 200MB of data - which a reasonable PC should be able to cope with.

    Having said that, since I suspect this is an assignment from school, the solution I suggest MAY NOT be what the teacher/professor/whatever has in mind - you may want to discuss your proposed solution with your teacher before you go and implement it.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Perhaps the object of the exercise is to sort the file in fragments, then merge the results together.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    how is this done in the industry?
    It's done by using a Relational Database Management System such as Sql Server, Oracle etc. The data type would be a BLOB (Binary Large Object). Storing, retrieving and scanning the BLOB data would be done using Active X database objects. More specifically, AppendChunk would write the data to the BLOB and GetChunk would read the data from the BLOB.

    The equivalent can be done using MS Access since the OLE object is a BLOB. This is a quick and dirty solution to your problem. I concur with MatSp about presenting your proposed solution to your instructor for review prior to implementing it.

  7. #7
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    You don't need an RDBMS to do this. Salem has the right idea, you read chunks of the file in one at a time, sort it, then write it out to disk. Each chunk is typically called a "run". You then merge your runs into a final file. So long as you can buffer one item from each run in memory you can sort the file efficiently (otherwise you end up with a huge I/O overhead finding the max/min element at each merge step).

  8. #8
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Memory management for large file editors tutorial is something that can be modifed to do what you need it to do.

  9. #9
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    Here's what I'm talking about: http://en.wikipedia.org/wiki/External_sort

  10. #10
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    If the OP reviews the memory management for large file editors tutorial, he'll realize how simple and easy it is to read these humongous files. If not, then he'll have to check out managing memory mapped files with WIN32.

  11. #11
    train spotter
    Join Date
    Aug 2001
    Location
    near a computer
    Posts
    3,868
    Quote Originally Posted by BobS0327 View Post
    The equivalent can be done using MS Access since the OLE object is a BLOB. This is a quick and dirty solution to your problem. I concur with MatSp about presenting your proposed solution to your instructor for review prior to implementing it.
    Access is a toy DB and would probably choke on a 500Mb blob.....


    BLOBs are storing the raw data, not processing the data into a useable format, so not sure it satisfies the OPs requirements.

    Large / numerous BLOBs are not a good idea if you want to back the DB up often (in a reasonable timeframe).
    "Man alone suffers so excruciatingly in the world that he was compelled to invent laughter."
    Friedrich Nietzsche

    "I spent a lot of my money on booze, birds and fast cars......the rest I squandered."
    George Best

    "If you are going through hell....keep going."
    Winston Churchill

  12. #12
    Devil™
    Join Date
    Oct 2007
    Location
    IIT-Kharagpur, India
    Posts
    104
    sorry for late reply, I had exams
    well.. I think I should have been more clear..

    ok.. I will explain my prob..

    I am trying to make a program that semi automatically extracts roads, rivers, structures etc from a high resolution IKONOS satellite Image using some algo..
    The satellite image is formatted.. and it will be in tiff file format typically it is around 500MB..

    I didnt want to get my hands dirty in digging into the file specs and making my own imageload() etc..
    I used windows default libs & free/opensource image manipulation libs.. and program was in C# .Net 2.0
    but all of them ended the same way.. app dosent respond, atleast for ~8 hrs it didnt so I killed the app..
    The same app was working fine with 10MB image tiff format.. (its cut from the original 500MB image)

    so I have no idea how this could be done to handle heavy files


    I was thinking if these libs I used, have some limit of memory.. all of them load the image to memory directly.. so there could have been some problem..

    the app I am making has GUI for user, so I need to read the file and save it in some easy way to retrieve the data..
    the user clicks on the image and the program extracts the structure using algorithm and creates a temporary map of it and saves the map in the end in tiff format..
    ex: if user clicks on road in the satellite image, the program scans the image to extend the road till it finds.. and make a new tiff file of the created map

    so I need to handle 2 tiff images with nearly equal sizes and could be easily accessible..
    like some map browser in googlemaps..

    so I am stuck here.. how can I possibly do this without getting everything into memory?
    I was thinking of some solution like paging control..
    It might be possible that I can control the virtual memory used by my app so that I keep everything in VM and draw it back when I want using simple array like call..

    Reading a ascii file is simple that I can do it easily.. but this file format seems to be complex.. and I dont know if all the data it has is easy to read by visiting a particular section of the file..

    I cant use the SQL/RDBMS solution as this app should be portable and light..
    Memory-Mapped files seem to be promising.. I will go through the docs given..

    I am the only one working on this..
    I want to get this done as simple as possible.. this is after all a training project..
    about my trainer? He sux big time.. the ppl there working on some other project are too busy to find how to make a program to show exit confirmation in VB.. I cant expect anything from them too..

    can anyone help me with this?
    C's Motto: who cares what it means? I just compile it!!

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Here are some options:
    1) Learn TIFF file format so you can read in and process reasonable sized chunks at a time.
    2) Find and use a TIFF library capable of handling large files.
    3) Find and use (or create) a TIFF image splitter to break down the image into multiple images which your .NET app can process.

    gg

  14. #14
    Devil™
    Join Date
    Oct 2007
    Location
    IIT-Kharagpur, India
    Posts
    104
    Quote Originally Posted by Codeplug View Post
    Here are some options:
    1) Learn TIFF file format so you can read in and process reasonable sized chunks at a time.
    2) Find and use a TIFF library capable of handling large files.
    3) Find and use (or create) a TIFF image splitter to break down the image into multiple images which your .NET app can process.

    gg
    sounds good, but tough
    any easier way at the cost of efficiency?

    one more question..
    can I manipulate the paging system for my program? i.e I force some page to load/unload to memory on the basis of my requirements..
    C's Motto: who cares what it means? I just compile it!!

  15. #15
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> can I manipulate the paging system for my program?
    Sure.
    http://msdn2.microsoft.com/en-us/library/ms810613.aspx
    http://msdn2.microsoft.com/en-us/library/ms810627.aspx

    Although that implies that you'll be reading the TIFF yourself. So you'll have to come up to speed with the TIFF format and Win32 memory management.
    (2) and/or (3) would be easier.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Handling multiple input files from command line
    By cnfwriter in forum C Programming
    Replies: 13
    Last Post: 08-25-2008, 08:07 AM
  2. Is there any way?
    By Blackbox in forum C++ Programming
    Replies: 2
    Last Post: 05-21-2003, 06:05 PM
  3. a simple C question...
    By DramaKing in forum C Programming
    Replies: 10
    Last Post: 07-28-2002, 02:04 PM
  4. API Reading Files, handle is always -1
    By Xei in forum C++ Programming
    Replies: 13
    Last Post: 05-06-2002, 10:16 PM
  5. opening input files
    By bracie in forum C++ Programming
    Replies: 3
    Last Post: 01-12-2002, 10:11 PM