Thread: Couple of questions about ASCII

  1. #31
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Actually, it looks like UTF8 without BOM:
    ȗ֗җƗנטקס

    gg

  2. #32
    Registered User
    Join Date
    May 2013
    Posts
    228
    Well, the file content is:

    קובץ טקסט

    Which means "text file" in hebrew... (the first three characters in your string are not hebrew characters)

  3. #33
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    One thing I hate doing is playing guess the encoding. If you want to see what hebrew looks like in UTF-8, use a good text editor and save in that particular encoding on purpose. Many editors support multiple encodings -- if you never paid attention before, God only knows what you are actually using. I can only promise you that the dump pictured does not match the UTF-8 encoding.

  4. #34
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    hexdump messed me up with every word being byte-swapped. The first UTF8 character is 0xd7 0xa7 - Unicode Character 'HEBREW LETTER QOF' (U+05E7)
    Which is the last character shown since it right-to-left.

    gg

  5. #35
    Registered User
    Join Date
    May 2013
    Posts
    228
    Quote Originally Posted by whiteflags View Post
    One thing I hate doing is playing guess the encoding. If you want to see what hebrew looks like in UTF-8, use a good text editor and save in that particular encoding on purpose. Many editors support multiple encodings -- if you never paid attention before, God only knows what you are actually using. I can only promise you that the dump pictured does not match the UTF-8 encoding.
    I'm not trying to play games.
    I was just curious about why the BOM is missing, and thought it might have something to do with what you wrote.

    No BOM - no Unicode.Or they just assume that a random text stream without a BOM is actually UTF-8 (which is true on Linuxes).
    That's it.

  6. #36
    Registered User
    Join Date
    May 2013
    Posts
    228
    Quote Originally Posted by Codeplug View Post
    hexdump messed me up with every word being byte-swapped. The first UTF8 character is 0xd7 0xa7 - Unicode Character 'HEBREW LETTER QOF' (U+05E7)
    Which is the last character shown since it right-to-left.

    gg
    Great, thanks.

  7. #37
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    956
    Text encoded in UTF-8 does not need a Byte Order Mark (BOM) because it is read 8 bits at a time, so there is no issue of little endian vs big endian.

  8. #38
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    BOM's are helpful to other operating systems that may need to process the file (as they may not default to UTF8, like Windows) - and to other *nix's that don't have UTF8 as their default locale (for whatever reason).

    gg

  9. #39
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    956
    Quote Originally Posted by Codeplug View Post
    BOM's are helpful to other operating systems that may need to process the file (as they may not default to UTF8, like Windows) - and to other *nix's that don't have UTF8 as their default locale (for whatever reason).
    True, but that is using the BOM in a way that it was not intended (which seems to happen with a lot of things in the Windows world!).

  10. #40
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> but that is using the BOM in a way that it was not intended
    ? I only know of a single intended use for a UTF8 BOM, and that is to mark the file as being UTF8 encoded.

    gg

  11. #41
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    956
    Quote Originally Posted by Codeplug View Post
    >> but that is using the BOM in a way that it was not intended
    ? I only know of a single intended use for a UTF8 BOM, and that is to mark the file as being UTF8 encoded.
    I think I'm in the wrong here. The Unicode site itself even states that the BOM is used as a signature:
    Q: When a BOM is used, is it only in 16-bit Unicode text?

    A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
    (Unicode FAQ)

    So using a BOM to indicate that a file is encoded in UTF-8 is very much its "intended" purpose.

  12. #42
    Registered User
    Join Date
    May 2013
    Posts
    228
    Quote Originally Posted by christop View Post
    Text encoded in UTF-8 does not need a Byte Order Mark (BOM) because it is read 8 bits at a time, so there is no issue of little endian vs big endian.
    Wait, I know that endianness is not relevant for 1-byte data, but I thought that UTF-8 is a variable-length encoding, that's what it says here...

  13. #43
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    It is still just a series of ordered bytes, so no endianess issues.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Couple Questions...
    By GameGenie in forum C++ Programming
    Replies: 4
    Last Post: 08-26-2005, 02:57 PM
  2. couple questions...
    By Rune Hunter in forum C# Programming
    Replies: 4
    Last Post: 10-03-2004, 10:57 AM
  3. A couple of Questions
    By johnnabn in forum C++ Programming
    Replies: 4
    Last Post: 02-24-2003, 10:10 PM
  4. Couple C questions :)
    By Divx in forum C Programming
    Replies: 5
    Last Post: 01-28-2003, 01:10 AM
  5. Couple Questions
    By Unregistered in forum C++ Programming
    Replies: 3
    Last Post: 11-04-2001, 05:14 PM