Thread: Getting really, really big file sizes in C

  1. #1
    Registered User
    Join Date
    Jun 2009
    Posts
    486

    Getting really, really big file sizes in C

    I am trying to read enormous binary files (10-100GB) and parse their contents a bit at a time. As part of the process I need to get the size of the file in bytes. The simple solution
    Code:
    fseek(file,0,SEEK_END);
    size=ftell(file);
    fails because the file size overflows the long int type returned by ftell. I need a long long int.

    Is there a reasonably efficient way to do this? The good news is that it only needs to be done once. I suppose I could read it one character at a time until I hit the end and keep count, but that just seems inelegant...
    C is fun

  2. #2
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    have a look at this stackoverflow link
    What can this strange device be?
    When I touch it, it gives forth a sound
    It's got wires that vibrate and give music
    What can this thing be that I found?

  3. #3
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    I don't know why it follows that you need to get the size of the file if you are processing it in chunks, but anyway.

    I've not got my standard to hand, but if there isn't a standard longer version there is undoubtedly a system-specific equivalent you can use for larger files (eg ftello on *nix systems).

  4. #4
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    I need the filesize for when processing the very last chunk - I need to make sure that I don't ask for too much at the end. Now that I say it like that, there's probably a more elegant way to avoid overflow.

    EDIT: fseeko64 and ftello64 apparently are what I was looking for.
    C is fun

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by KBriggs View Post
    I need the filesize for when processing the very last chunk - I need to make sure that I don't ask for too much at the end. Now that I say it like that, there's probably a more elegant way to avoid overflow.

    EDIT: fseeko64 and ftello64 apparently are what I was looking for.
    If you're reading from a file, and you ask for more than what's there, that's not an error. This is why fread or whatever you're using will tell you how many records were actually read.

  6. #6
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    I was trying to have a way to allocate exactly enough memory before running this:
    Code:
        union { //used to convert big-endian to little-endian
            double d;
            char   bytes[sizeof(double)];
        } ud;
        for (i = 0; i < maxlength*2; i++)
        {
            for (j = sizeof(double) - 1; j >= 0; j--)
            {
                fread(&ud.bytes[j], sizeof(char), 1, input); //switch the endian-ness
            }
            if (i%2 == 0) //discard the voltage
            {
                current[i/2] = ud.d;
            }
        }
        return current;
    which converts big-endian to little-endian and then discards half the input. I suppose I could just over-allocate and then return a count of successful reads instead.
    Last edited by KBriggs; 12-05-2013 at 10:14 AM.
    C is fun

  7. #7
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    On windows, you can get the file size like this:
    Code:
    #include <stdio.h>
    #include <windows.h>
    
    BOOL WINAPI GetFileSizeEx(HANDLE hFile, PLARGE_INTEGER lpFileSize);
    
    long int get_file_size(const char * filename) {
        LARGE_INTEGER largeInt;
        HANDLE hFile = CreateFile(filename, GENERIC_READ, 0, NULL,
                           OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
        GetFileSizeEx(hFile, &largeInt);
        CloseHandle(hFile);
        return largeInt.QuadPart;
    }
    
    int main() {
        long long int size = get_file_size(
            "your_file_name.xyz");
        printf("%lld\n", size);
        return 0;
    }
    But it's better to process the file one number at a time, if possible.
    Read a number, convert it, write it, read the next, etc.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  8. #8
    Registered User
    Join Date
    Dec 2011
    Location
    Namib desert
    Posts
    94
    Read this from the GNU C-Library reference:
    14.9.2 Reading the Attributes of a File

    To examine the attributes of files, use the functions stat, fstat and lstat. They return the attribute information in a struct stat object. All three functions are declared in the header file sys/stat.h.
    — Function: int stat (const char *filename, struct stat *buf)
    The stat function returns information about the attributes of the file named by filename in the structure pointed to by buf.
    If filename is the name of a symbolic link, the attributes you get describe the file that the link points to. If the link points to a nonexistent file name, then stat fails reporting a nonexistent file.
    The return value is 0 if the operation is successful, or -1 on failure. In addition to the usual file name errors (see File Name Errors, the following errno error conditions are defined for this function:
    ENOENTThe file named by filename doesn't exist. When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact stat64 since the LFS interface transparently replaces the normal implementation.
    — Function: int stat64 (const char *filename, struct stat64 *buf)
    This function is similar to stat but it is also able to work on files larger then 2^31 bytes on 32-bit systems. To be able to do this the result is stored in a variable of type struct stat64 to which buf must point.
    When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is available under the name stat and so transparently replaces the interface for small files on 32-bit

  9. #9
    Registered User
    Join Date
    Dec 2011
    Location
    Namib desert
    Posts
    94
    Through stat() you thus can find the filesize in bytes and if you're using the "low-level" read() function, you know exactly the number of bytes you've been reading.

    Quote from GNU C-Library reference:
    13.2 Input and Output Primitives

    This section describes the functions for performing primitive input and output operations on file descriptors: read, write, and lseek. These functions are declared in the header file unistd.h.
    — Data Type: ssize_t
    This data type is used to represent the sizes of blocks that can be read or written in a single operation. It is similar to size_t, but must be a signed type.

    — Function: ssize_t read (int filedes, void *buffer, size_t size)
    The read function reads up to size bytes from the file with descriptor filedes, storing the results in the buffer. (This is not necessarily a character string, and no terminating null character is added.)
    If the return-value of read() is 0, you've reached EOF. If return value of read() <0, some error might have occurred. To find out what actually happened if the return value of read() == < 0, you have to check the value of errno

  10. #10
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    For a portable library you can use APR and use the file functions to get the file size. Here is an example program

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <inttypes.h>
    #include "apr.h"
    #include "apr_file_io.h"
    #include "apr_file_info.h"
    
    int main()
    {
        apr_initialize();
        apr_file_t *file;
        apr_finfo_t finfo;
        apr_status_t rv;
        apr_pool_t *pool;
        if (apr_pool_create(&pool, NULL) != APR_SUCCESS) {
            exit(1);
        }
        if ((rv = apr_file_open(&file, "/foo/bar.txt", APR_FOPEN_READ, APR_OS_DEFAULT, pool)) > 0) {
            exit(1);
        }
        if ((rv = apr_file_info_get(&finfo, APR_FINFO_NORM, file)) > 0) {
            exit(1);
        }
    
        printf("File size in bytes: %"PRId64"\n", (int64_t)finfo.size);
        apr_file_close(file);
        apr_terminate();
    }

  11. #11
    Registered User
    Join Date
    Dec 2011
    Location
    Namib desert
    Posts
    94
    why use external libraries when there is a whealth of standard-C functions ?

  12. #12
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by ddutch View Post
    why use external libraries when there is a whealth of standard-C functions ?
    What standard are you referring too?
    stat, stat64 and read are not standard C.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  13. #13
    Registered User
    Join Date
    Dec 2011
    Location
    Namib desert
    Posts
    94
    @oogabooga & C99tutorial: Oops, you're right; the stat-family is not part of Ansi-C. I'm so used to Linux ....

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. File sizes
    By bradszy in forum C++ Programming
    Replies: 18
    Last Post: 04-28-2008, 05:24 AM
  2. Visual C++ 2005 linking and file sizes
    By Rune Hunter in forum C++ Programming
    Replies: 2
    Last Post: 11-12-2005, 10:41 PM
  3. dev-c++ file sizes are huge
    By Rune Hunter in forum C++ Programming
    Replies: 2
    Last Post: 10-21-2005, 04:08 PM
  4. Getting file sizes
    By mart_man00 in forum C Programming
    Replies: 22
    Last Post: 04-01-2003, 09:17 AM
  5. can i use file input to set array sizes in headers?
    By sanu in forum C++ Programming
    Replies: 2
    Last Post: 06-20-2002, 10:12 AM

Tags for this Thread