Thread: Unicode - UTF-8

  1. #1
    Registered User
    Join Date
    Oct 2007
    Posts
    22

    Unicode - UTF-8

    Hi all

    This post is actually related to programming, but I think it's neither C nor C++ specific or similar. I read a lot about Unicode and the UTF-8 character coding system today but there are still some things I don't really understand when programming.

    Assuming my operating system is configured only to use UTF-8 encoding, also for filesystem and everything. Now I got a text-file (UTF-8 encoded) and read it into a char-array using 'fgets' (or a C++ specific way to read the file, doesn't matter).

    When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly (assuming the text-file contains some 2-byte UTF-8 encoded characters and not only ASCII compatible ones)?
    Is there some kind of transformation done between the file and whatever is read into the char-array from my program? Or would I have to write the program in such a way it can handle UTF-8 encoded text?

    Regards
    Rafael Aggeler

  2. #2
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Nope, you need to set the code page. The C/C++ library has no knowledge of Unicode format specifiers. YOU have to tell it that it's dealing with utf-8.
    Don't ask me how, though, because I don't really know...
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Quote Originally Posted by plan7 View Post
    ...(assuming the text-file contains some 2-byte UTF-8 encoded characters
    There is no such critter. UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.

    UTF-16 is a two-byte-per-character encoding.

    Todd
    Mainframe assembler programmer by trade. C coder when I can.

  4. #4
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    But there is sometimes a 2-byte value at the beginning of a file that tells the encoding of the file, ie UTF-8, UTF-16, ANSI, etc.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  5. #5
    Registered User
    Join Date
    Oct 2007
    Posts
    22
    There is no such critter. UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.
    Sure there is. There are multi-byte sequences and these will not make sense when looking at them byte by byte.
    It would not be possible to support the whole range of Unicode with just having one byte for one character...

    From: http://weblogs.java.net/blog/satyak/...g_with_se.html
    Although it says '8' a character in utf-8 can take multiple bytes and hence can represent all the variations in the world's alphabet.
    Actually the 8 in UTF-8 sais that the units used are 8-bit bytes. The 16 in UTF-16 means that the units will be 16-bit words and not 8-bit bytes. But it doesn't mean that 1 character gets mapped to one byte. Otherwise extended ASCII with its 8 bits would have been sufficient and Unicode and its encoding systems would not exist...
    Last edited by plan7; 04-05-2008 at 12:09 PM.

  6. #6
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by plan7 View Post
    Is there some kind of transformation done between the file and whatever is read into the char-array from my program? Or would I have to write the program in such a way it can handle UTF-8 encoded text?
    Almost certainly the latter. I'm not familiar with this subject. I find the Standard Library to have very poor and confusing support for unicode. And the way its currently being done is almost screaming for everyone not to use it, except for those really wanting to get in a whole lot of pain.

    Anyway, the reason I'm answering is to tell you the solution is in the Standard Library. You may want to get an hand on Josuttis' The C++ Standard Library. It's the only book I know of that does try to explain it in somewhat layman terms. There's replacement objects that will handle your unicode needs.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  7. #7
    Registered User
    Join Date
    Oct 2007
    Posts
    22
    Ok,...thanks
    Well, it seems that it does not often happen that multibyte-encoded characters are used, otherwise we would have mess today...
    At least text stored in UTF-8 and read by software not knowing about UTF-8 would result in a mess...

  8. #8
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly
    Unlikely, unless the file happens to contain only single byte characters. As soon as you hit a multi-byte character, you'll get garbage output. Even then if the output medium doesn't properly support UTF-8, you still might get garbage (or nothing).

    >Is there some kind of transformation done between the file
    >and whatever is read into the char-array from my program?
    Only if you ask for it, such as using wide character I/O. But if you ask for char, you're going to get single-byte input and any transformation is up to you.

    >UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.
    UTF-8 has a range of 0x00 to 0x10FFF. Up to four bytes can be used for a single code point.

    >UTF-16 is a two-byte-per-character encoding.
    UTF-16 is also a multi-byte encoding. The surrogate system takes up a pair of 16-bit entities, so UTF-16 maxes out at four bytes as well.
    My best code is written with the delete key.

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by plan7 View Post
    H
    Assuming my operating system is configured only to use UTF-8 encoding, also for filesystem and everything. Now I got a text-file (UTF-8 encoded) and read it into a char-array using 'fgets' (or a C++ specific way to read the file, doesn't matter).

    When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly (assuming the text-file contains some 2-byte UTF-8 encoded characters and not only ASCII compatible ones)?
    Yes, if your system is really set up for UTF-8 everywhere, a simple program which just reads data and then prints it out should display the correct result. C and C++ should be agnostic to character encoding (except for the single restriction that in C, a 0 byte indicates "end of string," and this was addressed in UTF-8 by designing the code to be 0-free).

    If your program actually needs to PROCESS the data in some way, then it will of course have to be knowledgable of the UTF-8 encoding. In practice it is easier to transcode to a flat code like UTF-16 (which isn't truly flat either, as Prelude stated), do the work there, then transcode back to UTF-8 for output.

    But as far as handling the raw bytes, you don't have to do anything special. UTF-8 was designed that way on purpose.

  10. #10
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by plan7 View Post
    Ok,...thanks
    Well, it seems that it does not often happen that multibyte-encoded characters are used, otherwise we would have mess today...
    At least text stored in UTF-8 and read by software not knowing about UTF-8 would result in a mess...
    The presence of multi-byte characters does not necessarily cause a problem if the data is treated as opaque by client software. The presence of zero bytes, though, can really screw up C code. Therefore UTF-8 was designed not to use zero bytes anywhere.

  11. #11
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    And unfortunately, Windows does not support UTF-8 natively, so printing it will only result in garbage no matter what. Linux might be another matter, though.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  12. #12
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Most modern Linux distros are set up to use UTF-8 everywhere. If you read in a UTF-8 file and print it to the console, this should be handled correctly. (And the fact that you were able to configure the OS strongly suggests you're using some sort of Unix.)

    Generally speaking, though, C++'s character handling is a mess.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  13. #13
    Registered User Jaqui's Avatar
    Join Date
    Feb 2005
    Posts
    416
    Quote Originally Posted by Prelude View Post
    >
    >UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.
    UTF-8 has a range of 0x00 to 0x10FFF. Up to four bytes can be used for a single code point.

    >UTF-16 is a two-byte-per-character encoding.
    UTF-16 is also a multi-byte encoding. The surrogate system takes up a pair of 16-bit entities, so UTF-16 maxes out at four bytes as well.
    The END USER difference is in the language specific characters supported, UTF-16 supports a few more Asian Languages than UTF-8 does.
    [ I know, slightly ot ]

    The benefit to fighting C++ for UTF-8 or UTF-16 functionality is in supporting more language specific characters with only one encoding. If you do not need to support multiple character sets with your application, then you may not want to use the UTF-8 or UTF-16 encoding.
    I would use it myself, for the simplified support of multiple languages later on when / if needed.

    The *nix systems use it simply because it is far easier to use UTF-8 than to make end users install support for multiple languages when they do not know what language support they may need.
    A benefit of this support is visible to me reguarly, I have two email addresss getting email from the same list, the yahoo address cannot display the senders name for one list member, his Japanese characters com up as unknown character blocks in yahoo, while my linux based seamonkey email client displays his name in the character set he uses.
    Quote Originally Posted by Jeff Henager
    If the average user can put a CD in and boot the system and follow the prompts, he can install and use Linux. If he can't do that simple task, he doesn't need to be around technology.

  14. #14
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    The END USER difference is in the language specific characters supported, UTF-16 supports a few more Asian Languages than UTF-8 does.
    No, it doesn't. Both UTF-16 and UTF-8 are complete encodings of the Unicode character set. They support exactly the same range of characters.

    You may be confronted by old decoders that do not support 4-byte UTF-8 characters and thus can't decode characters outside the BMP. But the same goes for UCS-2 decoders (UTF-16 without surrogates).
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. <string> to LPCSTR? Also, character encoding: UNICODE vs ?
    By Kurisu33 in forum C++ Programming
    Replies: 7
    Last Post: 10-09-2006, 12:48 AM
  2. Unicode - a lot of confusion...
    By Jumper in forum Windows Programming
    Replies: 11
    Last Post: 07-05-2004, 07:59 AM
  3. Should I go to unicode?
    By nickname_changed in forum C++ Programming
    Replies: 10
    Last Post: 10-13-2003, 11:37 AM
  4. printing non-ASCII characters (in unicode)
    By dbaryl in forum C Programming
    Replies: 1
    Last Post: 10-25-2002, 01:00 PM
  5. UNICODE and GET_STATE
    By Registered in forum C++ Programming
    Replies: 1
    Last Post: 07-15-2002, 03:23 PM