Thread: Determining whether a file is text or binary

  1. #1
    Registered User
    Join Date
    Mar 2002
    Posts
    9

    Determining whether a file is text or binary

    I'm taking a compiler design course now, and the first project is to write a rudamentary lexer. The next part of the project will involve using lex, so this program is mainly here to build an appreciation for how easy lex makes stuff

    Anyway, I am trying to add a feature (it is not required, nor is it extra credit; I just want it there) that will allow the program to distinguish between text files and binary files. I should be able to parse a file that contains C++ code, for example; I should not be able to parse a compiled C++ program. How do I go about doing this?

  2. #2
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Just examine each byte and determine whether or not it's a printable character (isprint() comes to mind).
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  3. #3
    Registered User
    Join Date
    Mar 2002
    Posts
    9
    Thanks ... I figured there might have a built-in way to test if a file is binary or text, but I probably figured wrong

  4. #4
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    In general, no. But you can gather a lot by looking at other features such as the size of the file, file extension, or using the brute force method I mentioned of testing each byte (this is not nearly bad as it sounds. Just a few bytes into a binary file generally turns up values that would not be found in a text file). Also, you can test the file for things like '}' and '{', '=', etc (characters you encounter in a source file).



    Mod edit: removed code tags from text
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  5. #5
    Registered User
    Join Date
    Mar 2002
    Posts
    9
    Originally posted by Sebastiani

    In general, no. But you can gather a lot by looking at other features such as the size of the file, file extension, or using the brute force method I mentioned of testing each byte (this is not nearly bad as it sounds. Just a few bytes into a binary file generally turns up values that would not be found in a text file). Also, you can test the file for things like '{' and '}', '=', etc (characters you encounter in a source file).
    File size ... well, that could be anything, and even executables can be tiny.

    File extension ... it's on a Solaris 8 system, and executables generally don't have extensions on Unix boxen.

    I think brute-forcing it was probably the best path. Thanks again

  6. #6
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Another useful tool is the stat structure/function pair (see sys/stat.h). The nice thing about that is that you can find out a lot about the file without even opening it.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  7. #7
    Hardware Engineer
    Join Date
    Sep 2001
    Posts
    1,398

    The problem is...

    ...That a text file is a binary file with particular characteristics. You could do all of your file I/O without using the language features that make it easier to store and retreive text-formatted "data".

    All files are "binary". To us C++ programmers, "binary" just means that the compiler isn't going to do any formatting for you. And with text files, the compiler is mostly just converting between null-terminations and carrage-returns and/or line-feeds.

    As you probably already know, many (most?) file-formats will have a header that can be parsed to determine the format. Or, at least the header will give you a hint. Plain text files don't have a header.

    ...WOW!!! Compiler Design! Advanced Shtuff!

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    If you're so blessed as to be using unix or linux, then try the file command
    Code:
    $ file hello.c
    hello.c: ASCII C program text
    $ file a.out
    a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), not stripped
    $ file hello.o
    hello.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
    Short of reading the whole file and checking every byte, the best you can do is get a probable answer by checking say the first 256 bytes.
    For a text file, you would expect all the characters to be printable, and at least one newline within the first 256 chars.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Salem --

    This assumes your text file is ASCII only. I frequently have text files that will have unusual bytes because they are parts of multibyte characters (usually in shift-JIS)
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  10. #10
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    You can scan for any pattern of byte(s). The whole point is that all file types generally use a certain range of values (though some, like executables, use more of this 'bandwidth'). Probably, the first 256 bytes of data would provide enough of a sample to id most any file. Also, most files have an ascii string at the beginning of the file (for .exe's [PE's] - it's 'MZ') that would help you identify the type.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 12:13 AM
  2. A bunch of Linker Errors...
    By Junior89 in forum Windows Programming
    Replies: 4
    Last Post: 01-06-2006, 02:59 PM
  3. Batch file programming
    By year2038bug in forum Tech Board
    Replies: 10
    Last Post: 09-05-2005, 03:30 PM
  4. Replies: 3
    Last Post: 03-04-2005, 02:46 PM
  5. Removing text between /* */ in a file
    By 0rion in forum C Programming
    Replies: 2
    Last Post: 04-05-2004, 08:54 AM