Actually, it looks like UTF8 without BOM:
ȗ֗җƗנטקס
gg
Actually, it looks like UTF8 without BOM:
ȗ֗җƗנטקס
gg
Well, the file content is:
קובץ טקסט
Which means "text file" in hebrew... (the first three characters in your string are not hebrew characters)
One thing I hate doing is playing guess the encoding. If you want to see what hebrew looks like in UTF-8, use a good text editor and save in that particular encoding on purpose. Many editors support multiple encodings -- if you never paid attention before, God only knows what you are actually using. I can only promise you that the dump pictured does not match the UTF-8 encoding.
hexdump messed me up with every word being byte-swapped. The first UTF8 character is 0xd7 0xa7 - Unicode Character 'HEBREW LETTER QOF' (U+05E7)
Which is the last character shown since it right-to-left.
gg
I'm not trying to play games.
I was just curious about why the BOM is missing, and thought it might have something to do with what you wrote.
That's it.No BOM - no Unicode.Or they just assume that a random text stream without a BOM is actually UTF-8 (which is true on Linuxes).
Text encoded in UTF-8 does not need a Byte Order Mark (BOM) because it is read 8 bits at a time, so there is no issue of little endian vs big endian.
BOM's are helpful to other operating systems that may need to process the file (as they may not default to UTF8, like Windows) - and to other *nix's that don't have UTF8 as their default locale (for whatever reason).
gg
>> but that is using the BOM in a way that it was not intended
? I only know of a single intended use for a UTF8 BOM, and that is to mark the file as being UTF8 encoded.
gg
I think I'm in the wrong here. The Unicode site itself even states that the BOM is used as a signature:
(Unicode FAQ)Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
So using a BOM to indicate that a file is encoded in UTF-8 is very much its "intended" purpose.
Wait, I know that endianness is not relevant for 1-byte data, but I thought that UTF-8 is a variable-length encoding, that's what it says here...
It is still just a series of ordered bytes, so no endianess issues.
gg