Thread: Some strange behavior with file position

  1. #1
    Registered User
    Join Date
    Jun 2009
    Posts
    486

    Some strange behavior with file position

    I use the following function to read doubles from a binary file:

    Code:
    int64_t read_current(FILE *input, double *current, uint64_t position, uint64_t length)
    {
        printf("asked for %" PRIu64" samples\n",length);
        printf("reading from %" PRIu64 " to %" PRIu64 "\n",position,position+length);
        union{
            double d;
            uint64_t i;
        }ud;
    
        uint64_t i;
        int64_t read = 0;
    
        if (fseeko64(input,position*2*sizeof(double),SEEK_SET))
        {
            return 0;
        }
    
    
        for (i = 0; i < length; i++)
        {
            read += fscanf(input,"%8c%*8c",(char *)&ud.d);
            swapByteOrder(&ud.i);
            current[i] = ud.d;
        }
        printf("read %" PRId64 "\n",read);
        return read;
    }
    simple enough. As long as position+length doesn't exceed the end of the file, all is well. However, if it does, weird things start to happen. In particular, the apparent length of the file (that is, the value of "read" after the function has finished) becomes dependent on "length").

    Is there some undefined behavior here? I know that "position" is never past the end of the file, so I am not fseeking to a nonexistent place and then asking it to read...
    C is fun

  2. #2
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    You shouldn't use formatted operations to process binary data.

    As to your question, why wouldn't that be the case? You done nothing here to ensure that reading past the end doesn't "read" past the end of the file.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  3. #3
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    fseeko64 (and fseeko) are not part of the C standard. They do appear to be POSIX extensions, however. Googling for the documentation turned up this link, among others: File Positioning - The GNU C Library. Note the fseeko64 section:
    Function: int fseeko64 (FILE *stream, off64_t offset, int whence)
    Preliminary: | MT-Safe | AS-Unsafe corrupt | AC-Unsafe lock corrupt | See POSIX Safety Concepts.


    This function is similar to fseeko with the only difference that the offset parameter is of type off64_t. This also requires that the stream stream was opened using either fopen64, freopen64, or tmpfile64 since otherwise the underlying file operations to position the file pointer beyond the 2^31 bytes limit might fail.


    If the sources are compiled with _FILE_OFFSET_BITS == 64 on a 32 bits machine this function is available under the name fseeko and so transparently replaces the old interface.
    So a few things here. First, you should be using an off64_t type, not uint64_t. They are very likely the same, but no guarantee AFAIK. Also, you need to make sure to open with fopen64 et al. It would help if you told us the rough size of the file you're dealing with and any other relevant information. If you're beyond the 2^31 limit, it could definitely be trouble.

    Check the return code of all file related function calls and print a useful error message (e.g. perror or strerror+errno) if it fails (e.g. line 15). I'm suspicious of your seeking to position*2*sizeof(double). Also, if you do read += fscanf(), you can't properly check that fscanf succeeded. You should check the return value for success/fail, and add to read only if it succeeds. If it fails, error message and probably return immediately from the function.

  4. #4
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    Thanks,

    I realized that what was probably happening was that I was adding EOF to read repeatedly past the end of the file, but I guess there are more problems than that, even. In particular, I am not using fopen64 at the moment. I will implement all that tomorrow morning at work and get back to you.

    The files could be up to 100GB of 8 byte doubles (occurring in pairs, one of which I care about, hence the arguments of fscanf). So while most of the time it won't be past the 2^31 limit, there's no guarantee it won't be. In the present example I am using to debug, however, it is only about 1GB.
    C is fun

  5. #5
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by KBriggs View Post
    Thanks,

    I realized that what was probably happening was that I was adding EOF to read repeatedly past the end of the file, but I guess there are more problems than that, even. In particular, I am not using fopen64 at the moment. I will implement all that tomorrow morning at work and get back to you.

    The files could be up to 100GB of 8 byte doubles (occurring in pairs, one of which I care about, hence the arguments of fscanf). So while most of the time it won't be past the 2^31 limit, there's no guarantee it won't be. In the present example I am using to debug, however, it is only about 1GB.
    Declare an array of 2 doubles and use fread to read 2 doubles into that array. Call swapByteOrder() on the first double and assign the result to current[i]. Ignore the second double. Repeat for length number of entries. You can then drop the sizeof(double) from fseeko64 and your union declaration.
    Last edited by anduril462; 05-06-2014 at 04:08 PM.

  6. #6
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    would I need fseeko64 at all, then? if the file pointer is advanced by fread then it should be reflected in the caller as well, correct?

    Is there any problem using the regular fopen to open big files if I am not seeking?
    C is fun

  7. #7
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by KBriggs View Post
    would I need fseeko64 at all, then? if the file pointer is advanced by fread then it should be reflected in the caller as well, correct?
    Correct. If you only ever process the file sequentially, you don't need to seek. However, if you ever jump around elsewhere or reset to the beginning, etc, then you need to seek to the right place.
    Quote Originally Posted by KBriggs View Post
    Is there any problem using the regular fopen to open big files if I am not seeking?
    I believe it depends on your implementation. You may need to #define some feature test macros to affect this. Read the man pages and documentation.

  8. #8
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    You probably only need to the "64" versions if you are compiling/using a 32 bit program on a 64 bit operating system. But again you should check your Operating system and compiler's documentation to be sure.

    Jim

  9. #9
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    According to the documentation I need the 64 versions if I want to apply an offset larger than a 32 bit unsigned int can hold. Which I do need to do elsewhere, even if I can get around it here by using fread. But the points ITT points are well taken, I will implement and test all of them tomorrow.
    C is fun

  10. #10
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    According to the documentation I need the 64 versions if I want to apply an offset larger than a 32 bit unsigned int can hold.
    O_o

    That may be true, but the `fopen64' and friends version is not standard.

    As has been said, consider using the ` _FILE_OFFSET_BITS' macro instead if your system has support. The macro effectively aliases your client code to use interfaces compatible to both the common 32 bit and 64 bit versions. So, you can rather easily hit support for "Mac OS X", "Windows", and "Linux" as built for 32bit or 64bit code so long as the builder has a replacement layer.

    Writing your code with that in mind allows you to more transparently support platforms with `fopen64' and friends while also supporting systems where `fopen' and friends have native 64 bit support. For most intents and purposes, you'll just use the versions you already know letting the layer--such as `_FILE_OFFSET_BITS'--do whatever it does to provide 64bit support.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  11. #11
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    I'm not sure I understand. I had a quick read through some documentation for that macro, but I'm not clear. Can I #define _FILE_OFFSET BITS 64 in main.c and then simply use the usual file functions, or do I have to compile with -D _FILE_OFFSET_BITS=64? Do they accomplish the same thing?
    C is fun

  12. #12
    Registered User
    Join Date
    Jun 2011
    Posts
    4,513
    This is outside of my expertise, but according to this, either should work, with the former being recommended.

    You should define these macros by using ‘#define’ preprocessor directives at the top of your source code files. These directives must come before any #include of a system header file. It is best to make them the very first thing in the file, preceded only by comments. You could also use the ‘-D’ option to GCC, but it's better if you make the source files indicate their own meaning in a self-contained way.

  13. #13
    Registered User
    Join Date
    Jun 2009
    Posts
    486
    So I added #define _FILE_OFFSET_BITS 64 to the top of all my .h and .c files (probably overkill, but where's the harm?). Is there any way to tell if it's actually doing anything?

    The code itself was modified as follows:

    Code:
    int64_t read_current(FILE *input, double *current, uint64_t position, uint64_t length)
    {
        uint64_t test;
        double iv[2];
    
        uint64_t i;
        int64_t read = 0;
    
        if (fseek(input,position*2*sizeof(double),SEEK_SET))
        {
            return 0;
        }
    
    
        for (i = 0; i < length; i++)
        {
            test = fread(iv, sizeof(double), 2, input);
            if (test == 2)
            {
                read++;
                swapByteOrder((uint64_t *) &iv[0]);
                current[i] = iv[0];
            }
            else
            {
                perror("End of file reached: ");
                break;
            }
    
        }
        printf("read %" PRId64 "\n",read);
        return read;
    }
    I have to say thanks to anduril for the suggestion to use fread instead of fscanf - the program is literally an order of magnitude faster now. Crazy.

    I realized that the way things are set up I can't actually get around the use of fseek() - the chunks of data I read might overlap slightly and so I do need to reposition the file pointer on each call. I was wondering about the use of off_t vs uint64_t - is it enough to take my uint64_t and cast it to an off_t, or do I need to do something else? eg

    Code:
    if (fseek(input,(off_t) (position*2*sizeof(double)),SEEK_SET))
    EDIT: I did a
    Code:
    printf("%d\n",sizeof(off_t));
    which printed 4, whereas off64_t printed 8 - so I think the function test macro is not doing it's job as implemented... So for now I will stick with the 64 versions explicitly, since portability is not something I will be needing for the time being.
    Last edited by KBriggs; 05-07-2014 at 08:15 AM.
    C is fun

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Strange behavior in getline based writing to file.
    By dozerman in forum C++ Programming
    Replies: 4
    Last Post: 01-28-2013, 08:46 AM
  2. Need help, please - very strange behavior
    By snork in forum C Programming
    Replies: 16
    Last Post: 09-26-2011, 01:36 AM
  3. Strange behavior
    By onako in forum C++ Programming
    Replies: 1
    Last Post: 05-01-2010, 07:00 AM
  4. file processing - strange behavior
    By dontoo in forum C Programming
    Replies: 3
    Last Post: 03-17-2010, 08:37 AM
  5. strange behavior
    By agarwaga in forum C Programming
    Replies: 1
    Last Post: 10-17-2005, 12:03 PM

Tags for this Thread