Thread: utf8

  1. #1
    Registered User
    Join Date
    Jun 2008


    Hi CBoard members.
    I want to write a function which will return true or false based on if the given string is Utf8 encoded or not.

    for example:
    boo isUtf8(const string & str) {}
    Since this is a multi byte encoding, currently I am just checking null chars. If there are more than one null chars then the string is Utf8 encoded else just a plain ASCII string.

    Please tell me what else can I check for in my function?

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    The null char check won't work. UTF-8 doesn't contain NULs.

    There is no definite way of checking for UTF-8. There are some leads.
    1) If the first three bytes are 0xEF 0xBB 0xBF, you've probably found a UTF-8-encoded byte order mark. Probability of UTF-8 is very high. But UTF-8 doesn't require the BOM, some editors don't understand it, and to the best of my knowledge, Notepad is the only editor that inserts it.
    2) If you don't have a BOM, parse the whole file, or at least a part of it (the more, the better the results, but of course it takes longer). If you find only bytes in the 0x7F range, the text is probably ASCII, so you can treat it as UTF-8. If you find high-bit bytes, check them for UTF-8 validity. If the sequences are invalid, you don't have UTF-8. If all sequences are valid, you probably have UTF-8.
    All the buzzt!

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. tolower and locale
    By MK27 in forum C Programming
    Replies: 16
    Last Post: 02-03-2009, 09:05 PM
  2. adjusting character counts for utf8
    By MK27 in forum C Programming
    Replies: 32
    Last Post: 02-03-2009, 07:07 PM
  3. utf8 en-de coding
    By SurferIX in forum C Programming
    Replies: 4
    Last Post: 05-18-2008, 05:23 PM
  4. How to convert raw text with accent to UTF8.
    By intmail in forum Linux Programming
    Replies: 1
    Last Post: 08-09-2006, 10:47 AM
  5. File Reading and storing to 1 variable
    By Rare177 in forum C Programming
    Replies: 34
    Last Post: 07-13-2004, 12:58 PM