non-ascii character in UTF-8 string

**Malcolm McLean** · 11-08-2013

UTF-8 is pretty clever. The ascii code proper is only defined for 0-127. These bytes are passed through as is,
so any asciiz string is also a UFT-8 string. To support longer codes, we obviously need nmore thna one byte per character. So the rule is that is a byte starts 11 is had one byte following it, 111 two bytes, 1111 three bytes. 11111 and higher is illegal. The following bytes always start 10. So an ascii character can never appears as a byte in a UTF-8 string, unless it actually represents that character.
So you have to write a little routine, probably using ANDs to mask off bits, to skip the non-ascii bytes. If you need the non-ascicc values, you need to check the coding conventions, which are pretty strightforwards. Again you'll need logical operators and shifts to extract the value from the codes.
Note that a lot of sequences are aillegal in UTF-8, eg 110xxxxx not followed by a 10xxxxxx bytes. So you can pretty easily tell if a byte sequence of any length is UTF-8 or not, with a very low chance of a false positive.

**Codeplug** · 11-08-2013

>> unsigned char arr[] = "x√ab c";
The in-memory encoding of that string is implementation defined. It could even depend on how your source code editor saved the file. See the link in my prior post for further details.

>> Or is there any way i check how many bytes(s) does a character occupy ?
You can't count the glyphs unless you know what the encoding is. If you know the encoding will always be UTF8, then you have plenty of code to work with in this thread. If you don't know the encoding then you can only count the non-ascii bytes (<128).

gg

Thread: non-ascii character in UTF-8 string

Thread Tools

Search Thread

Display

Similar Threads

Character to ASCII

always prints ASCII character 255

decimal ascII character

ASCII character with ASCII value 0 and 32

Character to Ascii

Tags for this Thread