Couple of questions about ASCII

**Codeplug** · 08-31-2015

Actually, it looks like UTF8 without BOM:
ȗ֗җƗנטקס

gg

**Absurd** · 08-31-2015

Well, the file content is:

קובץ טקסט

Which means "text file" in hebrew... (the first three characters in your string are not hebrew characters)

**whiteflags** · 08-31-2015

One thing I hate doing is playing guess the encoding. If you want to see what hebrew looks like in UTF-8, use a good text editor and save in that particular encoding on purpose. Many editors support multiple encodings -- if you never paid attention before, God only knows what you are actually using. I can only promise you that the dump pictured does not match the UTF-8 encoding.

**Codeplug** · 08-31-2015

hexdump messed me up with every word being byte-swapped. The first UTF8 character is 0xd7 0xa7 - Unicode Character 'HEBREW LETTER QOF' (U+05E7)
Which is the last character shown since it right-to-left.

gg

**Absurd** · 08-31-2015

Originally Posted by whiteflags

One thing I hate doing is playing guess the encoding. If you want to see what hebrew looks like in UTF-8, use a good text editor and save in that particular encoding on purpose. Many editors support multiple encodings -- if you never paid attention before, God only knows what you are actually using. I can only promise you that the dump pictured does not match the UTF-8 encoding.

I'm not trying to play games.
I was just curious about why the BOM is missing, and thought it might have something to do with what you wrote.

No BOM - no Unicode.Or they just assume that a random text stream without a BOM is actually UTF-8 (which is true on Linuxes).

That's it.

**Absurd** · 08-31-2015

Originally Posted by Codeplug

hexdump messed me up with every word being byte-swapped. The first UTF8 character is 0xd7 0xa7 - Unicode Character 'HEBREW LETTER QOF' (U+05E7)
Which is the last character shown since it right-to-left.

gg

Great, thanks.

**christop** · 08-31-2015

Text encoded in UTF-8 does not need a Byte Order Mark (BOM) because it is read 8 bits at a time, so there is no issue of little endian vs big endian.

**Codeplug** · 08-31-2015

BOM's are helpful to other operating systems that may need to process the file (as they may not default to UTF8, like Windows) - and to other *nix's that don't have UTF8 as their default locale (for whatever reason).

gg

**christop** · 08-31-2015

Originally Posted by Codeplug

BOM's are helpful to other operating systems that may need to process the file (as they may not default to UTF8, like Windows) - and to other *nix's that don't have UTF8 as their default locale (for whatever reason).

True, but that is using the BOM in a way that it was not intended (which seems to happen with a lot of things in the Windows world!).

**Codeplug** · 08-31-2015

>> but that is using the BOM in a way that it was not intended
? I only know of a single intended use for a UTF8 BOM, and that is to mark the file as being UTF8 encoded.

gg

**christop** · 08-31-2015

Originally Posted by Codeplug

>> but that is using the BOM in a way that it was not intended
? I only know of a single intended use for a UTF8 BOM, and that is to mark the file as being UTF8 encoded.

I think I'm in the wrong here. The Unicode site itself even states that the BOM is used as a signature:

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.

(Unicode FAQ)

So using a BOM to indicate that a file is encoded in UTF-8 is very much its "intended" purpose.

**Absurd** · 09-01-2015

Originally Posted by christop

Text encoded in UTF-8 does not need a Byte Order Mark (BOM) because it is read 8 bits at a time, so there is no issue of little endian vs big endian.

Wait, I know that endianness is not relevant for 1-byte data, but I thought that UTF-8 is a variable-length encoding, that's what it says here...

**Codeplug** · 09-01-2015

It is still just a series of ordered bytes, so no endianess issues.

gg

Thread: Couple of questions about ASCII

Thread Tools

Search Thread

Display

Similar Threads

Couple Questions...

couple questions...

A couple of Questions

Couple C questions :)

Couple Questions