Thread: Dealing with huge files

  1. #1
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137

    Question Dealing with huge files

    I've been getting into memory allocation lately. I wrote a couple of my own (basic) memory allocators and really enjoyed it.

    One topic of interest to me is handling large files. Many text/code editors out there, as well as other apps, have lackluster memory management. This is evident by the fact that in the case of the text editors, opening a 500MB text file for example, may make the program hang or crash. In the case of other apps, Slack recently rewrote their chat app to consume FIFTY PERCENT LESS MEMORY with the exact same feature set. However, there are other text editors, such as 010 Editor which are designed for large files.

    My question is, what considerations does a C programmer think about when designing a program to handle large file input? Say I wanted to make a text editor that could open up to 10GB sized files, and have a smooth, seamless user interface and search feature. What would I need to do differently in this case versus designing a text editor that only needs to support 1-5MB files?

    My first thought for large files is using mmap() but I also assume that the buffer size, chunk size, and more are very influential in the performance of handling large amounts of data. In fact, mmap() seems to be behind malloc anyway so I wonder if just using malloc would suffice since malloc can technically allocate up to like an exabyte+ on 64 bit systems anyway.
    Last edited by Asymptotic; 08-18-2019 at 02:15 AM.
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

  2. #2
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,739
    Quote Originally Posted by Asymptotic View Post
    My question is, what considerations does a C programmer think about when designing a program to handle large file input? Say I wanted to make a text editor that could open up to 10GB sized files, and have a smooth, seamless user interface and search feature. What would I need to do differently in this case versus designing a text editor that only needs to support 1-5MB files?
    First of all, loading the whole file in RAM is out of the question, you can't assume the user has 10GB memory. Therefore you have to load the file in chunks, and keep unloading and reloading them as the user moves around. There are many things to consider, for example if you make the chunks too large there would be huge lag spikes when loading/saving (possibly irritating the user), but if you make them too small even the act of scrolling around would cause many small lag spikes (probably irritating the user). No, don't use paging you'll make the whole system lag (definitely irritating the user).

    For the most part though, most text editors crush and burn not because of memory usage but because they do a lot of processing on that text, and those algorithms aren't very efficient (Nē complexity or more). For example, if the approach of an editor highlighting is to add metadata (like HTML tags) to the text, on a 500MB file it would have to do many, many, many string comparisons before displaying it.
    Devoted my life to programming...

  3. #3
    null pointer Structure's Avatar
    Join Date
    May 2019
    Posts
    338
    "without goto we would be wtf'd"

  4. #4
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,739
    Oh, and Structure just reminded me that in C, the FILE functions (ftell, fseek, etc), return/take "long" as the file offset. Depending on your OS, "long" may be 32-bit even if your system is 64-bit.
    For instance, Windows uses LLP64("long long"s and pointers are 64-bit) while Linux uses LP64(longs and pointers are 64-bit). So, if you want to be able to handle files larger than 4GB, you may have to use platform-specific functions.
    Devoted my life to programming...

  5. #5
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137
    @GReaper,

    Interesting. I suppose some concurrent programming, or at least using the cores efficiently, could potentially increase responsiveness as well for some of those user-mode tasks. But that also assumes that the user has a machine with a decent # of cores/threads.

    What I'm hearing from you is that once one is past finding the "sweet spot" of data accesses (reading chunks from memory/the file), the next biggest concern is all of the processing on that data. Immediately, I thought of for example, a poorly-implemented Electron text editor vs. a fine-tuned editor written in C or C++. I suppose the Electron editor in most cases will be less performant due to garbage collection and interpreter overhead ("processing"), all other things constant (assuming the algorithms of the program author are the same).
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

  6. #6
    Registered User
    Join Date
    May 2019
    Posts
    214
    My own favored solution to this problem is memory mapped file services. The interfaces differ on *Nix v Windows, but they are similar enough that a simple abstraction layer allows either OS to serve memory mapped files with a common application level interface.

    There are minor "gotchas", but are not that difficult to deal with. In particular, mapping a file requires the file's size to match the mapped memory. When extending the file past the end, some effort is required slightly beyond merely writing out new data as with the standard file i/o functions, but it's not that much.

    Otherwise, however, the application treats the file as memory. It is rather efficient and powerful. Memory mapping allows the creation of views or windows between the file and RAM.

  7. #7
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by GReaper View Post
    Oh, and Structure just reminded me that in C, the FILE functions (ftell, fseek, etc), return/take "long" as the file offset. Depending on your OS, "long" may be 32-bit even if your system is 64-bit.
    For instance, Windows uses LLP64("long long"s and pointers are 64-bit) while Linux uses LP64(longs and pointers are 64-bit). So, if you want to be able to handle files larger than 4GB, you may have to use platform-specific functions.
    fgetpos and fsetpos take a pointer to fpos_t (which should be a type that's large enough for a "large" file). These have been around since C89, so they're hardly platform-specific.

    There's also fseeko and ftello, which are defined on all reasonable platforms (which excludes Windows). Windows has _fseeki64 and _ftelli64 as equivalents.

  8. #8
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,739
    Quote Originally Posted by christop View Post
    fgetpos and fsetpos take a pointer to fpos_t (which should be a type that's large enough for a "large" file). These have been around since C89, so they're hardly platform-specific.
    I think you misunderstood the use of fgetpos/fsetpos. fpos_t isn't (or rather, may not be) a simple integer. The content of an fpos_t object is not meant to be read/written directly, but only to be used as an argument in a call to fsetpos. That means, you can only fsetpos to locations you already got with fgetpos. As such, it's useless for random access of a file.
    Devoted my life to programming...

  9. #9
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by GReaper View Post
    I think you misunderstood the use of fgetpos/fsetpos. fpos_t isn't (or rather, may not be) a simple integer. The content of an fpos_t object is not meant to be read/written directly, but only to be used as an argument in a call to fsetpos. That means, you can only fsetpos to locations you already got with fgetpos. As such, it's useless for random access of a file.
    Yes, that's a fair point.

    I think it's a pity that there's no standard file positioning functions in C that takes/returns a 64-bit or longer integer type.

  10. #10
    Old Fashioned
    Join Date
    Nov 2016
    Posts
    137
    Yeah what's the deal with this type of stuff? Is there another update to C planned which could include some of this type of stuff, or like when we get to 96 bit or whatever?
    If I was homeless and jobless, I would take my laptop to a wifi source and write C for fun all day. It's the same thing I enjoy now!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Dealing with windows files
    By Hexxx in forum C++ Programming
    Replies: 7
    Last Post: 08-17-2006, 02:11 PM
  2. searching huge files
    By MadCow257 in forum C++ Programming
    Replies: 10
    Last Post: 03-08-2006, 06:18 AM
  3. Problems openning huge files
    By acoelho74 in forum C Programming
    Replies: 4
    Last Post: 02-11-2003, 07:02 PM
  4. Newbie: Huge EXE files for simple programs - why?
    By Dominic in forum A Brief History of Cprogramming.com
    Replies: 14
    Last Post: 10-11-2002, 04:07 PM
  5. dealing with files.....
    By agerealm in forum C Programming
    Replies: 2
    Last Post: 02-02-2002, 01:31 PM

Tags for this Thread