Thread: fopening and fseeking: text or binary file?

  1. #1
    Registered User
    Join Date
    Mar 2008
    Posts
    82

    fopening and fseeking: text or binary file?

    Now I just use "r" when I fopen a file, 'cos I have most of my datafiles in text.

    When I started to use fseek, I ws reading stuff about it being better to dealing with files in binary ... reasons being that the terminal encoding mess up your text file as it gets read into the c prog (I'm a bit hazy on it, but the whole UTF8 and unicode thing, which I've never been able to grasp, may have something to do with it).

    Anyhow, if it's true about fseek's preference for binary files, well should I start converting all my many datafiles over to binary? Or maybe I'm not understanding the whole thing properly.

    Any guidance appreciated. Many thanks.

  2. #2
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    It depends on what function you use to write each data item or struct. If you use fprintf(), then text mode should be fine. If you use fwrite(), then you need to use binary mode.

  3. #3
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Just be careful what you do with text files. On windows, newline characters are two bytes ("\r\n"), but file open in text mode on Windows will only see a "\n". This can make reading Unix file on Windows / Windows files on Unix a bit interesting. (You should convert them. Things like FTP do this mostly automatically.)

    EDIT: (Had an example of something here with fseek(), so if you saw it...) Oops, my example, and apperantly my knowledge, was flawed. Seems to work as expected. Might have to look into this.

    UTF-8 should probably be read in text mode. I don't think UTF-8 enforces whether a newline is "\n" or "\r\n".

    Generally, I don't find myself fseeking in text that much.
    Last edited by Cactus_Hugger; 07-18-2008 at 07:11 PM.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  4. #4
    Registered User
    Join Date
    Mar 2008
    Posts
    82
    Generally, I don't find myself fseeking in text that much.
    Well, for operating on only parts of a file (which match a certain pattern) how far can you go with c? That's kind of Perl territory there (yeh, ok, python and ruby too).

    I thought the fundamental problem is that if you're dealing with a lagre number of sometime dubious datafiles which might have a foreign language header, you could get certain characters being represented by two or more bytes, though they may appear to be a single character in the text.

    I was surmising that binary mode ensures that every character is represented by one byte.

    Any comments about how in the dark / in the light I am, welcome.

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by stabu View Post
    Well, for operating on only parts of a file (which match a certain pattern) how far can you go with c? That's kind of Perl territory there (yeh, ok, python and ruby too).
    All of which are written in C, so I would say that anything you can do in Perl, Python, Ruby, PHP, etc, can be done in C too - you just have to write a bit more code to do it.
    I thought the fundamental problem is that if you're dealing with a lagre number of sometime dubious datafiles which might have a foreign language header, you could get certain characters being represented by two or more bytes, though they may appear to be a single character in the text.

    I was surmising that binary mode ensures that every character is represented by one byte.

    Any comments about how in the dark / in the light I am, welcome.
    Multi-character characters [for want of a better word for it] still take up multiple "character positions", however, so it will have exactly the same effect in binary and in text-mode - they are one character only when representing them on the screen.

    The only place where text or binary makes a difference is in the sense of newlines and file endings, and not all OS's have specific file-endings for text files [the latter goes back to CP/M and other ancient OS's, where the filesize was stored in blocks rather than bytes (to save space), and the only way to know where some random length text file ends was to put a marker in that position, typically CTRL-Z - all modern OS's that I know of store the number of bytes actually used in the file, so there's no meaning to have a special character at the end of the file].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Registered User
    Join Date
    Mar 2008
    Posts
    82
    thanks for the history Mats!

    all modern OS's that I know of store the number of bytes actually used in the file
    Is that not the filesystem? what do you mean by "in the file". I have been fseeking to the end of the file to find its size, but is there another way?

  7. #7
    Registered User
    Join Date
    Mar 2006
    Posts
    725
    Yes, but you will have to use non-C APIs, like POSIX, the Windows API, glib, etc. In any case, if you're dealing with UTF-8, you've probably already decided which platforms you are going to code for.
    Code:
    #include <stdio.h>
    
    void J(char*a){int f,i=0,c='1';for(;a[i]!='0';++i)if(i==81){
    puts(a);return;}for(;c<='9';++c){for(f=0;f<9;++f)if(a[i-i%27+i%9
    /3*3+f/3*9+f%3]==c||a[i%9+f*9]==c||a[i-i%9+f]==c)goto e;a[i]=c;J(a);a[i]
    ='0';e:;}}int main(int c,char**v){int t=0;if(c>1){for(;v[1][
    t];++t);if(t==81){J(v[1]);return 0;}}puts("sudoku [0-9]{81}");return 1;}

Popular pages Recent additions subscribe to a feed