PDA

View Full Version : Unicode - UTF-8



plan7
04-05-2008, 10:11 AM
Hi all

This post is actually related to programming, but I think it's neither C nor C++ specific or similar. I read a lot about Unicode and the UTF-8 character coding system today but there are still some things I don't really understand when programming.

Assuming my operating system is configured only to use UTF-8 encoding, also for filesystem and everything. Now I got a text-file (UTF-8 encoded) and read it into a char-array using 'fgets' (or a C++ specific way to read the file, doesn't matter).

When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly (assuming the text-file contains some 2-byte UTF-8 encoded characters and not only ASCII compatible ones)?
Is there some kind of transformation done between the file and whatever is read into the char-array from my program? Or would I have to write the program in such a way it can handle UTF-8 encoded text?

Regards
Rafael Aggeler

Elysia
04-05-2008, 11:02 AM
Nope, you need to set the code page. The C/C++ library has no knowledge of Unicode format specifiers. YOU have to tell it that it's dealing with utf-8.
Don't ask me how, though, because I don't really know...

Dino
04-05-2008, 11:30 AM
...(assuming the text-file contains some 2-byte UTF-8 encoded characters

There is no such critter. UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.

UTF-16 is a two-byte-per-character encoding.

Todd

Elysia
04-05-2008, 11:43 AM
But there is sometimes a 2-byte value at the beginning of a file that tells the encoding of the file, ie UTF-8, UTF-16, ANSI, etc.

plan7
04-05-2008, 11:46 AM
There is no such critter. UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.

Sure there is. There are multi-byte sequences and these will not make sense when looking at them byte by byte.
It would not be possible to support the whole range of Unicode with just having one byte for one character...

From: http://weblogs.java.net/blog/satyak/archive/2004/05/working_with_se.html

Although it says '8' a character in utf-8 can take multiple bytes and hence can represent all the variations in the world's alphabet.

Actually the 8 in UTF-8 sais that the units used are 8-bit bytes. The 16 in UTF-16 means that the units will be 16-bit words and not 8-bit bytes. But it doesn't mean that 1 character gets mapped to one byte. Otherwise extended ASCII with its 8 bits would have been sufficient and Unicode and its encoding systems would not exist...

Mario F.
04-05-2008, 12:19 PM
Is there some kind of transformation done between the file and whatever is read into the char-array from my program? Or would I have to write the program in such a way it can handle UTF-8 encoded text?

Almost certainly the latter. I'm not familiar with this subject. I find the Standard Library to have very poor and confusing support for unicode. And the way its currently being done is almost screaming for everyone not to use it, except for those really wanting to get in a whole lot of pain.

Anyway, the reason I'm answering is to tell you the solution is in the Standard Library. You may want to get an hand on Josuttis' The C++ Standard Library. It's the only book I know of that does try to explain it in somewhat layman terms. There's replacement objects that will handle your unicode needs.

plan7
04-05-2008, 01:02 PM
Ok,...thanks :)
Well, it seems that it does not often happen that multibyte-encoded characters are used, otherwise we would have mess today...
At least text stored in UTF-8 and read by software not knowing about UTF-8 would result in a mess...

Prelude
04-07-2008, 08:32 AM
>When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly
Unlikely, unless the file happens to contain only single byte characters. As soon as you hit a multi-byte character, you'll get garbage output. Even then if the output medium doesn't properly support UTF-8, you still might get garbage (or nothing).

>Is there some kind of transformation done between the file
>and whatever is read into the char-array from my program?
Only if you ask for it, such as using wide character I/O. But if you ask for char, you're going to get single-byte input and any transformation is up to you.

>UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.
UTF-8 has a range of 0x00 to 0x10FFF. Up to four bytes can be used for a single code point.

>UTF-16 is a two-byte-per-character encoding.
UTF-16 is also a multi-byte encoding. The surrogate system takes up a pair of 16-bit entities, so UTF-16 maxes out at four bytes as well.

brewbuck
04-07-2008, 09:01 AM
H
Assuming my operating system is configured only to use UTF-8 encoding, also for filesystem and everything. Now I got a text-file (UTF-8 encoded) and read it into a char-array using 'fgets' (or a C++ specific way to read the file, doesn't matter).

When I print the read char-array (or string) using 'printf' (or 'cout'), will it be printed correctly (assuming the text-file contains some 2-byte UTF-8 encoded characters and not only ASCII compatible ones)?

Yes, if your system is really set up for UTF-8 everywhere, a simple program which just reads data and then prints it out should display the correct result. C and C++ should be agnostic to character encoding (except for the single restriction that in C, a 0 byte indicates "end of string," and this was addressed in UTF-8 by designing the code to be 0-free).

If your program actually needs to PROCESS the data in some way, then it will of course have to be knowledgable of the UTF-8 encoding. In practice it is easier to transcode to a flat code like UTF-16 (which isn't truly flat either, as Prelude stated), do the work there, then transcode back to UTF-8 for output.

But as far as handling the raw bytes, you don't have to do anything special. UTF-8 was designed that way on purpose.

brewbuck
04-07-2008, 09:03 AM
Ok,...thanks :)
Well, it seems that it does not often happen that multibyte-encoded characters are used, otherwise we would have mess today...
At least text stored in UTF-8 and read by software not knowing about UTF-8 would result in a mess...

The presence of multi-byte characters does not necessarily cause a problem if the data is treated as opaque by client software. The presence of zero bytes, though, can really screw up C code. Therefore UTF-8 was designed not to use zero bytes anywhere.

Elysia
04-07-2008, 09:34 AM
And unfortunately, Windows does not support UTF-8 natively, so printing it will only result in garbage no matter what. Linux might be another matter, though.

CornedBee
04-07-2008, 10:38 AM
Most modern Linux distros are set up to use UTF-8 everywhere. If you read in a UTF-8 file and print it to the console, this should be handled correctly. (And the fact that you were able to configure the OS strongly suggests you're using some sort of Unix.)

Generally speaking, though, C++'s character handling is a mess.

Jaqui
04-15-2008, 06:52 PM
>
>UTF-8 has range of 0x00 to 0xff, therefore they are all single bye characters.
UTF-8 has a range of 0x00 to 0x10FFF. Up to four bytes can be used for a single code point.

>UTF-16 is a two-byte-per-character encoding.
UTF-16 is also a multi-byte encoding. The surrogate system takes up a pair of 16-bit entities, so UTF-16 maxes out at four bytes as well.

The END USER difference is in the language specific characters supported, UTF-16 supports a few more Asian Languages than UTF-8 does.
[ I know, slightly ot ]

The benefit to fighting C++ for UTF-8 or UTF-16 functionality is in supporting more language specific characters with only one encoding. If you do not need to support multiple character sets with your application, then you may not want to use the UTF-8 or UTF-16 encoding.
I would use it myself, for the simplified support of multiple languages later on when / if needed.

The *nix systems use it simply because it is far easier to use UTF-8 than to make end users install support for multiple languages when they do not know what language support they may need.
A benefit of this support is visible to me reguarly, I have two email addresss getting email from the same list, the yahoo address cannot display the senders name for one list member, his Japanese characters com up as unknown character blocks in yahoo, while my linux based seamonkey email client displays his name in the character set he uses.

CornedBee
04-16-2008, 03:16 AM
The END USER difference is in the language specific characters supported, UTF-16 supports a few more Asian Languages than UTF-8 does.
No, it doesn't. Both UTF-16 and UTF-8 are complete encodings of the Unicode character set. They support exactly the same range of characters.

You may be confronted by old decoders that do not support 4-byte UTF-8 characters and thus can't decode characters outside the BMP. But the same goes for UCS-2 decoders (UTF-16 without surrogates).