How to Handle heavy input files (~500MB)?

**gibsosmat** · 11-21-2007

I want to know how to handle heavy input files that are around 500MB

with normal school way of reading everything into memory wont be good..
and it fails sometimes if the library cant go to that extent..

how is this done in the industry?

Example:
Write a program to scan a text file of size ~500MB having sequence of numbers and can be scanned into an array.. using quick sort generate some output. the algorithm reads and writes to the same file..

what is the efficient way to do it?
efficient in the sense it wont get too heavy on memory..

can we force the paging of our program with some defined criteria?

And as the last thing..
what if the file format is not normal txt, but a compressed file having its own specs?
where it is not easy to get something meaningful from inside the file, without going through the starting part to that part

**Codeplug** · 11-21-2007

>> with normal school way of reading everything into memory wont be good..
That depends on what you mean by "normal school way". If you mean by reading the entire file into memory, then that may not be good for a 500MB file - depending on available resources.
If you mean using C/C++ standard file I/O - then that's only as good as the implementation. If you're using the MSCRT then it shouldn't be a problem.

>> and it fails sometimes if the library cant go to that extent..
I'm assuming here that you're talking about going beyond the maximum file size that standard I/O functions can represent - which is typically 2^32 bytes - so we're ok there with a 500MB file.

>> efficient in the sense it wont get too heavy on memory..
The easy solution is to simply think of the file itself as memory. Use "fseek()" to move your read/write pointer and perform the read or write. Since QSort is a "divide and conquer" algorithm, you can probably work on "chunks" at a time.

>> what if the file format is not normal txt, but a compressed file having its own specs?
In the end - you got to do what you got to do. (hows that for an answer) You may be forced to uncompress the file(s) and even split them up into manageable pieces. Or figure out how to update the file in-place.

>> how is this done in the industry?
Well, I typically look at "correctness" first, and then performance. Once you have something that works, you can then think about having multiple threads - there can be disk bound threads for reading and writing, and CPU bound threads for performing quicksort. Your bottleneck is still gonna be the disk but at least you can do some sorting while reading and writing.

gg

**VirtualAce** · 11-22-2007

1. Allocate a buffer of a pre-determined size
2. Set filepos to 0
3. Do a block read - length of buffer size
4. Increment filepos by bytes read
5. Do something with data
6. Write out new data to new file
7. Wash, rinse, repeat for length of file.
8. Save old filename and delete file.
9. Rename new data file to old filename.

This is a simple caching scheme. What you should do though depends on a lot of different variables. What are you doing with the data? What is the format of the data?

**matsp** · 11-22-2007

You can't sort the entire file content using quick-sort - in fact, if you need to sort the entire file and the file is ~500MB, the file content would have to be read into memory. [Unless you do a quick-sort "on file" - which is of course possible, but it would be rather slow, because quick-sort accesses each member of the sorted data multiple times, which means reading/writing the data multiple times].

Of course, if it's a text file containing numbers, then each number is most likely longer in the text file [unless ALL the numbers are really small integers]. A 32-bit integer [up to +/- 2000000000 and a bit] can be stored in 4 bytes, but takes up on average just under 11 bytes on disk, so a 500MB file will contain about 50M numbers, which makes 200MB of data - which a reasonable PC should be able to cope with.

Having said that, since I suspect this is an assignment from school, the solution I suggest MAY NOT be what the teacher/professor/whatever has in mind - you may want to discuss your proposed solution with your teacher before you go and implement it.

--
Mats

**Salem** · 11-22-2007

Perhaps the object of the exercise is to sort the file in fragments, then merge the results together.

**BobS0327** · 11-22-2007

how is this done in the industry?

It's done by using a Relational Database Management System such as Sql Server, Oracle etc. The data type would be a BLOB (Binary Large Object). Storing, retrieving and scanning the BLOB data would be done using Active X database objects. More specifically, AppendChunk would write the data to the BLOB and GetChunk would read the data from the BLOB.

The equivalent can be done using MS Access since the OLE object is a BLOB. This is a quick and dirty solution to your problem. I concur with MatSp about presenting your proposed solution to your instructor for review prior to implementing it.

**Perspective** · 11-22-2007

You don't need an RDBMS to do this. Salem has the right idea, you read chunks of the file in one at a time, sort it, then write it out to disk. Each chunk is typically called a "run". You then merge your runs into a final file. So long as you can buffer one item from each run in memory you can sort the file efficiently (otherwise you end up with a huge I/O overhead finding the max/min element at each merge step).

**BobS0327** · 11-22-2007

Memory management for large file editors tutorial is something that can be modifed to do what you need it to do.

**Perspective** · 11-22-2007

Here's what I'm talking about: http://en.wikipedia.org/wiki/External_sort

**BobS0327** · 11-22-2007

If the OP reviews the memory management for large file editors tutorial, he'll realize how simple and easy it is to read these humongous files. If not, then he'll have to check out managing memory mapped files with WIN32.

**novacain** · 11-22-2007

Originally Posted by BobS0327

The equivalent can be done using MS Access since the OLE object is a BLOB. This is a quick and dirty solution to your problem. I concur with MatSp about presenting your proposed solution to your instructor for review prior to implementing it.

Access is a toy DB and would probably choke on a 500Mb blob.....

BLOBs are storing the raw data, not processing the data into a useable format, so not sure it satisfies the OPs requirements.

Large / numerous BLOBs are not a good idea if you want to back the DB up often (in a reasonable timeframe).

**gibsosmat** · 11-25-2007

sorry for late reply, I had exams

well.. I think I should have been more clear..

ok.. I will explain my prob..

I am trying to make a program that semi automatically extracts roads, rivers, structures etc from a high resolution IKONOS satellite Image using some algo..
The satellite image is formatted.. and it will be in tiff file format typically it is around 500MB..

I didnt want to get my hands dirty in digging into the file specs and making my own imageload() etc..
I used windows default libs & free/opensource image manipulation libs.. and program was in C# .Net 2.0
but all of them ended the same way.. app dosent respond, atleast for ~8 hrs it didnt so I killed the app..
The same app was working fine with 10MB image tiff format.. (its cut from the original 500MB image)

so I have no idea how this could be done to handle heavy files

I was thinking if these libs I used, have some limit of memory.. all of them load the image to memory directly.. so there could have been some problem..

the app I am making has GUI for user, so I need to read the file and save it in some easy way to retrieve the data..
the user clicks on the image and the program extracts the structure using algorithm and creates a temporary map of it and saves the map in the end in tiff format..
ex: if user clicks on road in the satellite image, the program scans the image to extend the road till it finds.. and make a new tiff file of the created map

so I need to handle 2 tiff images with nearly equal sizes and could be easily accessible..
like some map browser in googlemaps..

so I am stuck here.. how can I possibly do this without getting everything into memory?
I was thinking of some solution like paging control..
It might be possible that I can control the virtual memory used by my app so that I keep everything in VM and draw it back when I want using simple array like call..

Reading a ascii file is simple that I can do it easily.. but this file format seems to be complex.. and I dont know if all the data it has is easy to read by visiting a particular section of the file..

I cant use the SQL/RDBMS solution as this app should be portable and light..
Memory-Mapped files seem to be promising.. I will go through the docs given..

I am the only one working on this..
I want to get this done as simple as possible.. this is after all a training project..
about my trainer? He sux big time.. the ppl there working on some other project are too busy to find how to make a program to show exit confirmation in VB.. I cant expect anything from them too..

can anyone help me with this?

**Codeplug** · 11-26-2007

Here are some options:
1) Learn TIFF file format so you can read in and process reasonable sized chunks at a time.
2) Find and use a TIFF library capable of handling large files.
3) Find and use (or create) a TIFF image splitter to break down the image into multiple images which your .NET app can process.

gg

**gibsosmat** · 11-26-2007

Originally Posted by Codeplug

Here are some options:
1) Learn TIFF file format so you can read in and process reasonable sized chunks at a time.
2) Find and use a TIFF library capable of handling large files.
3) Find and use (or create) a TIFF image splitter to break down the image into multiple images which your .NET app can process.

gg

sounds good, but tough

any easier way at the cost of efficiency?

one more question..
can I manipulate the paging system for my program? i.e I force some page to load/unload to memory on the basis of my requirements..

**Codeplug** · 11-26-2007

>> can I manipulate the paging system for my program?
Sure.
http://msdn2.microsoft.com/en-us/library/ms810613.aspx
http://msdn2.microsoft.com/en-us/library/ms810627.aspx

Although that implies that you'll be reading the TIFF yourself. So you'll have to come up to speed with the TIFF format and Win32 memory management.
(2) and/or (3) would be easier.

gg

Thread: How to Handle heavy input files (~500MB)?

Thread Tools

Search Thread

Display

How to Handle heavy input files (~500MB)?

Similar Threads

Handling multiple input files from command line

Is there any way?

a simple C question...

API Reading Files, handle is always -1

opening input files