Thread: How to check file encoding ?

  1. #1
    Registered User
    Join Date
    Dec 2007
    Posts
    30

    How to check file encoding ?

    Hi,

    I made a program that inputs a text file and analyses it (line by line with fgets in a loop, main string functions used are strlen, strtok, strchar, strcpy, strcmp). When the input file is encoded in Unicode, it crashes at some point while reading the file. Interestengly, it doesn't crash if the input file is in UTF-8 encoding.

    How can I check if the file encoding is ANSI as expected ?

    Best wishes,
    Desomond

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    It's strange that it crashes. I could expect it to "behave strangely", but it shouldn't crash. Can you post the code, perhaps?

    It appears that the text starts with FF FE if it's unicode UCS-2 [16-bit Unicode].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Dec 2007
    Posts
    30
    Quote Originally Posted by matsp View Post
    It's strange that it crashes. I could expect it to "behave strangely", but it shouldn't crash. Can you post the code, perhaps?
    Ofcourse. It's quite lenghty so I'll post a link where I uploaded it instead:

    http://www.andresraieste.pri.ee/ttu/...for2/CPeatus.c

    Beware though, as the code is commented in my native language - estonian.

    An example of input file it's supposed to eat:
    http://www.andresraieste.pri.ee/ttu/...peatusdemo.txt

    And executable (compiled under MinGW on Vista 64bit):
    http://www.andresraieste.pri.ee/ttu/...r2/CPeatus.exe

    The program is supposed to analyze a cerain text file with keys and values (given as time) and calculate a bit with those..

    I have no ideas why it crashes.. What do you think ?

    Best wishes, Desmond
    Last edited by desmond5; 03-10-2008 at 10:33 AM.

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Not sure.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Registered User
    Join Date
    Dec 2007
    Posts
    30
    Quote Originally Posted by matsp View Post
    Not sure.

    --
    Mats
    I am trying to reproduce the error by cutting out stuff from the program that aren't related to it. Maybe that will help.

  6. #6
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by desmond5 View Post
    I am trying to reproduce the error by cutting out stuff from the program that aren't related to it. Maybe that will help.
    Always a good plan. I didn't spend many seconds looking at it. The fact that I can just about GUESS what the different names mean and what the comments say could just as well be written in invisible ink for the use I have for them. I'm sure I could figure it out, but it's a bit harder when you don't get the language...

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  7. #7
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by desmond5 View Post
    Hi,

    I made a program that inputs a text file and analyses it (line by line with fgets in a loop, main string functions used are strlen, strtok, strchar, strcpy, strcmp). When the input file is encoded in Unicode, it crashes at some point while reading the file. Interestengly, it doesn't crash if the input file is in UTF-8 encoding.
    You're probably banging into null bytes. UTF-8 is specifically designed not to use the value 0, because it would break C string handling.

    How can I check if the file encoding is ANSI as expected ?
    Impossible to tell.

  8. #8
    Registered User
    Join Date
    Dec 2007
    Posts
    30
    Quote Originally Posted by matsp View Post
    I'm sure I could figure it out, but it's a bit harder when you don't get the language...
    No question about that. I shrinked the program a lot and uploaded the source and the input file that produces the error here: http://www.andresraieste.pri.ee/ttu/IABB/err/

    Quote Originally Posted by brewbuck View Post
    You're probably banging into null bytes. UTF-8 is specifically designed not to use the value 0, because it would break C string handling.
    So..the workaround would be ? I need check if there are null bytes in the program and fail if there are any ? There has to be some workaround.

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by desmond5 View Post
    So..the workaround would be ? I need check if there are null bytes in the program and fail if there are any ? There has to be some workaround.
    The solution is to use actual wide character processing functions.

  10. #10
    Registered User
    Join Date
    Dec 2007
    Posts
    30
    Quote Originally Posted by brewbuck View Post
    Impossible to tell.
    How about checking byte order mark ? Is that doable ?

  11. #11
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by desmond5 View Post
    How about checking byte order mark ? Is that doable ?
    If you know there is a byte order mark, then you've already decided that it's Unicode, right?

    The file could be in some encoding from Jupiter and it just happens to start with a few bytes that look like the Unicode byte order mark.

    I think that if you see a byte order mark, the best thing is probably to assume it's Unicode, just be aware that it's not proof of anything. The only way to know for sure what the encoding is, is for somebody to tell you what it is.

  12. #12
    Registered User
    Join Date
    Dec 2007
    Posts
    30
    I just wanted to produce a correct error message instead of the program crashing. How should I proceed with that, then ?

  13. #13
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by desmond5 View Post
    I just wanted to produce a correct error message instead of the program crashing. How should I proceed with that, then ?
    Well, the crash is a separate issue. Even if you wrongly guess the encoding, the presence of a zero byte shouldn't cause a crash. But since I haven't seen your code, I can't really guess what's actually wrong.

    I'd go ahead and assume that if you see a Unicode byte order marker, then the encoding is Unicode -- otherwise assume ASCII. But you still need to be safe against the crash in either case. So first debug that issue and don't pollute your thinking with the Unicode vs. ASCII stuff.

  14. #14
    Registered User
    Join Date
    Aug 2006
    Posts
    100
    Also see http://blogs.msdn.com/oldnewthing/ar.../24/95235.aspx
    and http://blogs.msdn.com/oldnewthing/ar...7/2158334.aspx
    for some problems with text file encoding markers...and the problems with them.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. allegro issues
    By mramazing in forum C++ Programming
    Replies: 1
    Last Post: 01-07-2009, 11:56 PM
  2. Formatting the contents of a text file
    By dagorsul in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2008, 12:36 PM
  3. Basic text file encoder
    By Abda92 in forum C Programming
    Replies: 15
    Last Post: 05-22-2007, 01:19 PM
  4. Post...
    By maxorator in forum C++ Programming
    Replies: 12
    Last Post: 10-11-2005, 08:39 AM
  5. simulate Grep command in Unix using C
    By laxmi in forum C Programming
    Replies: 6
    Last Post: 05-10-2002, 04:10 PM