Determining if a file is binary or text

This is a discussion on Determining if a file is binary or text within the C++ Programming forums, part of the General Programming Boards category; Hey, Is there any functions I could use to determine if a file is text or binary? At the moment ...

  1. #1
    Banned nickname_changed's Avatar
    Join Date
    Feb 2003
    Location
    Australia
    Posts
    986

    Determining if a file is binary or text

    Hey,

    Is there any functions I could use to determine if a file is text or binary? At the moment in my program, if I dont recognise the extension I automatically open as text, but is there a routine I can call to tell me if a file is binary or text? I would prefer if it was a standard call, but anything windows specific would be good too. Thanks for any information on this topic.

  2. #2
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    its OS specific, basically you would have to scan the header of the file and see what type of file it is.
    the executable information (if its binary) should be there.

    Or if you don't want to hastle...
    you could scan the first, oh lets say 1028 chars into an array and run a test on the chars to see if there is a proportion of so many alphanumeric characters to non-alphanumeric characters and then if the proportion is too out of bounds assume that the file is binary. I do not personally know of a function to do this, if your using Linux there is a stat function to give specifics on the file and it will tell you if its an executable but other than that I don't know of any function.

    -LC
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

  3. #3
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    A file doesn't have binary/text info attached. Even a UNIX executable (as reported by stat) could either be a compiled app (binary) or a shell script (text).

    The only feasible way is as Lynux suggested: scan the first 100 or more bytes and test the printable vs. non-printable character ratio. If, say, among the first 100 bytes there are more than 5 0-bytes you've likely got a binary file.

    Of course there are mixed files, even binary files can contain sections of plain text (like the string tables in executables).
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  4. #4
    Banned nickname_changed's Avatar
    Join Date
    Feb 2003
    Location
    Australia
    Posts
    986
    Ok, thanks very much guys. Looks like I'll just have to keep a list of whats binary, and send everything as text by default.

  5. #5
    Registered User
    Join Date
    Jan 2003
    Posts
    311
    On unix you would probably rely on the file utility. A good resource for file formats is wotsit There is only so much you can do with a completely unknown file format, but when in doubt I recommend opening in binary. the big difference here is the possibility of embeded ^Z(^D in unix) eof chars. \r\n nastyness is also a problem.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    I agree. Do it like FTP apps and use binary by default.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    I said stat just because if it is an executable file and the right permissions are set (a whole lot of assumptions here) you could 'cheat' your way through it.

    If you want a PRO program to do it what you would have to do is learn the header formats for ELF, EXE, A.OUT etc for the filesystems, get these formats into your program and then when you open the file, see if the file's header matches any of those if it does, then you know its a binary AND you know what kind of executable it is. HOWEVER not all binary files are executable so in the rest of your program you could scan for very rare characters that you would find in a plain-text file. You can never be completely sure, you can only test and assume based on your tests, I could open a hex editor to a TXT file and start writing random bits and characters but based on tests it might come up as a binary file. Or worse (this actually happened to me) when your programming and working with a multi-byte character language, Japanese, and you scan a file the ODD-AS-HELL characters would most likely come up as a binary file.

    So the real test is how far are you going to test, what kind of tests are you going to perform and how sure do you want to be? I still think the best bet is do a alphanum to non-alphanum proportion test, its just simpler that way.

    -LC
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

  8. #8
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    Oh yeah one last IMPORTANT and HELPFUL note, binary executable files have a magic bit that can also help to identify them... this is VERY important
    EX:
    A file’s first 4 bytes hold a ‘‘magic number,’’ identifying the file as an ELF object file. 0x7f I believe is what it is...

    EX:
    Code:
    char base[8];
    char ELF[5];
    ELF[0]=0x7f;
    ELF[1]='E';
    ELF[2]='L';
    ELF[3]='F';
    ELF[4]='\0';
    ifstream in("filename");
    int j=0;
    for(j=0;j<4;j++)
      in.getc(base[j]);
    base[4]='\0';
    if(!strvcmp(base,ELF,strlen(ELF)))
      cout<<"The file is an ELF Executable file"<<endl;
    -LC
    Last edited by Lynux-Penguin; 08-05-2003 at 01:42 PM.
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 12:13 AM
  2. A bunch of Linker Errors...
    By Junior89 in forum Windows Programming
    Replies: 4
    Last Post: 01-06-2006, 01:59 PM
  3. Batch file programming
    By year2038bug in forum Tech Board
    Replies: 10
    Last Post: 09-05-2005, 03:30 PM
  4. Replies: 3
    Last Post: 03-04-2005, 01:46 PM
  5. Determining whether a file is text or binary
    By MadGooseXP in forum C++ Programming
    Replies: 9
    Last Post: 01-21-2004, 01:15 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21