Thread: non-ascii character in UTF-8 string

  1. #16
    Registered User
    Join Date
    May 2012
    Posts
    505
    UTF-8 is pretty clever. The ascii code proper is only defined for 0-127. These bytes are passed through as is,
    so any asciiz string is also a UFT-8 string. To support longer codes, we obviously need nmore thna one byte per character. So the rule is that is a byte starts 11 is had one byte following it, 111 two bytes, 1111 three bytes. 11111 and higher is illegal. The following bytes always start 10. So an ascii character can never appears as a byte in a UTF-8 string, unless it actually represents that character.
    So you have to write a little routine, probably using ANDs to mask off bits, to skip the non-ascii bytes. If you need the non-ascicc values, you need to check the coding conventions, which are pretty strightforwards. Again you'll need logical operators and shifts to extract the value from the codes.
    Note that a lot of sequences are aillegal in UTF-8, eg 110xxxxx not followed by a 10xxxxxx bytes. So you can pretty easily tell if a byte sequence of any length is UTF-8 or not, with a very low chance of a false positive.
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  2. #17
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> unsigned char arr[] = "x√ab c";
    The in-memory encoding of that string is implementation defined. It could even depend on how your source code editor saved the file. See the link in my prior post for further details.

    >> Or is there any way i check how many bytes(s) does a character occupy ?
    You can't count the glyphs unless you know what the encoding is. If you know the encoding will always be UTF8, then you have plenty of code to work with in this thread. If you don't know the encoding then you can only count the non-ascii bytes (<128).

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Character to ASCII
    By larrydeloafer in forum C Programming
    Replies: 5
    Last Post: 02-28-2011, 05:39 PM
  2. always prints ASCII character 255
    By browser in forum C Programming
    Replies: 5
    Last Post: 11-07-2010, 10:37 AM
  3. decimal ascII character
    By byfreak in forum C++ Programming
    Replies: 3
    Last Post: 05-24-2008, 10:36 PM
  4. ASCII character with ASCII value 0 and 32
    By hitesh_best in forum C Programming
    Replies: 4
    Last Post: 07-24-2007, 09:45 AM
  5. Character to Ascii
    By tdep in forum C Programming
    Replies: 6
    Last Post: 07-10-2006, 03:07 PM

Tags for this Thread