wchar_t, i18n, l10n and other oddities

Printable View

05-09-2008
samblack

wchar_t, i18n, l10n and other oddities

Greetings all.

I'm writing myself a little GTK app and I'd like to have proper internationalization/localization support. I've done a lot of reading on the subject but a couple things still baffle me:

I've decided to default to UTF-8 which seems to be the norm for most programs now. I understand that UTF-8 is encoded and works with standard char, however, I've also read that I should be using wchar_t throughout my program instead in case I need to change to UTF-16 or other encodings. Is this true, or should I simply use char?

It's my understanding that wchar_t both increases the memory usage of a program rather dramatically and is also platform dependent (2 bytes in win32, 4 in linux, etc) so that's my most biting question...

Should I be using wchar_t or std::wstring?
What really is the difference between char/std::string and wchar_t/std::wstring besides basic encaspulation? Does it merely exist to make die hard C++ people warm and fuzzy or is there a real technical reason why the [w]string class is superior to simple [w]char arrays? (Sorry, I come from a C background)

How does encoded (UTF-8,UTF-16,etc) data work across a network? Is it basically the same as using char or are there "weird" things I need to know/look out for?

Along those lines, how does data look to a human if saved to a file? I'd like people in the US to be able to see the data in a regular text editor, but I'm afraid if everything is UTF-8 encoded, it'll look like garbly-gook....

Sorry for all the questions and I really appreciate any insight you can give me.

j
05-09-2008
King Mir

Quote:

Originally Posted by samblack

Greetings all.

I'm writing myself a little GTK app and I'd like to have proper internationalization/localization support. I've done a lot of reading on the subject but a couple things still baffle me:

I've decided to default to UTF-8 which seems to be the norm for most programs now. I understand that UTF-8 is encoded and works with standard char, however, I've also read that I should be using wchar_t throughout my program instead in case I need to change to UTF-16 or other encodings. Is this true, or should I simply use char?

UTF-8 is hard to work with in a program, because it is a variable width encoding.

wchar_t is meant to provide a multi-byte character type, so that each character is more than one byte. This makes fixed width international encoding possible.

Quote:

It's my understanding that wchar_t both increases the memory usage of a program rather dramatically and is also platform dependent (2 bytes in win32, 4 in linux, etc) so that's my most biting question...

wchar_t is 16 bits on Windows, because Microsoft decided that UTF-16 behaves like a fixed width encoding in most every day uses. But the platform differences do not matter, because wchar_t is not an encoding specification; it's just a character type. What matters is the implementation of the stream library, which is used to read characters to and from files. That will depend on the compiler, rather than the platform.

Quote:

Should I be using wchar_t or std::wstring?
What really is the difference between char/std::string and wchar_t/std::wstring besides basic encaspulation? Does it merely exist to make die hard C++ people warm and fuzzy or is there a real technical reason why the [w]string class is superior to simple [w]char arrays? (Sorry, I come from a C background)

std::wstring and std::string enable easy string modification without worrying about buffer sizes; They will expand to fit any amount of data, unlike a fixed size character array. They provide comparison operators. They have a few usefull string manipulation methods. They also have methods common to stl containers, witch allow them to act like vector<char>.

Quote:

How does encoded (UTF-8,UTF-16,etc) data work across a network? Is it basically the same as using char or are there "weird" things I need to know/look out for?

That depends on your network libraries. Both sides need to agree on an encoding standard, be it UTF-8, UTF-16, ascii, or something else.

Quote:

Along those lines, how does data look to a human if saved to a file? I'd like people in the US to be able to see the data in a regular text editor, but I'm afraid if everything is UTF-8 encoded, it'll look like garbly-gook....

Most text file formats have a header that specifies the encoding used. That way a text editor will know how to interpret them.
05-09-2008
samblack

Quote:

UTF-8 is hard to work with in a program, because it is a variable width encoding.

Wouldn't that be taken care of simply by using the String object?

Quote:

wchar_t is meant to provide a multi-byte character type, so that each character is more than one byte. This makes fixed width international encoding possible.

Yes I understand that. I guess my question is, do I need to use wchar_t if I'll be using UTF-8 as a default? My understanding is that if I wish to add the option of using another encoding type, say for our asian friends, I might as well code the whole app using wchar_t as I'll need it for UTF-16 and others... Is this correct or should I just forget about wchar_t altogether?

Quote:

std::wstring and std::string enable easy string modification without worrying about buffer sizes

So basically it's just for convenience? Again, sorry, I come from C and am used to dealing with memory and array size hassles...

Quote:

That depends on your network libraries. Both sides need to agree on an encoding standard, be it UTF-8, UTF-16, ascii, or something else.

I most certainly understand that both client and server need to be speaking the same encoding but what libraries are you referring to? All you really need to communicate is a buffer and the standard library. Give me a socket, send, recv, and a buffer and I'll give you a client/server app :)
I guess my problem here is, for example, in ASCII I can count on things like "if I find a \r\n then I know the server is done sending information", which I don't know if I can do with UTF-8 encoded data (or even how to input it into a String() buffer)...
05-09-2008
samblack

I'm sorry, I forgot to thank you for your reply. It was very educational!
05-09-2008
Codeplug

What you have to remember is that UTF-8 is an encoding of a sequence of Unicode characters. UTF-16LE is an encoding of Unicode characters.

Unicode is a character set, and UTF-8 and UTF-16LE are ways to encode a given seqeunce of Unicode characters. So if a spoken language can be represented in Unicode, then it can be encoded using either method.

So - if you support UTF-8, then you support Unicode (using a UTF-8 encoding).

Good info:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

http://www.i18nguy.com/unicode/c-unicode.html
MS targeted, but very informative. Has a table of file BOM's. Main site is nice too.

gg
05-09-2008
King Mir

Quote:

Originally Posted by samblack

Wouldn't that be taken care of simply by using the String object?

No. std::string treats each char as a single element, but does not prevent you from using UTF-8 anyway. For example, length() and size() will return th number of chars, not the number if Unicode characters.

Quote:

Yes I understand that. I guess my question is, do I need to use wchar_t if I'll be using UTF-8 as a default? My understanding is that if I wish to add the option of using another encoding type, say for our asian friends, I might as well code the whole app using wchar_t as I'll need it for UTF-16 and others... Is this correct or should I just forget about wchar_t altogether?

Using wchar_t will mean that each wchar_t will represent a single Unicode character. This is convenient if you are doing any kind of string manipulation. But you can still do string manipulation in UTF-8 if you prefer.

It is also necessary for printing and reading: Standard libraries will have a specific encoding for each char type - generally extended ascii for char and UTF-16 or UTF-32 for wchar_t. So if you're trying to print or read the characters from standard input/output, you need wchar_t.

Quote:

So basically it's just for convenience? Again, sorry, I come from C and am used to dealing with memory and array size hassles...

How do you deal with memory and array size hassles in C? There are several ways:
1) Have a fixed may size. std::string will do away with this restriction. In this case it's not "just convenience", because by using std::string, you are adding features to your code.
2) Manually resize the array if it gets to big. The problem with this is that you are making what should be a simple task -- reading a string from wherever -- into a multi-line conglomerate. std::string is a way of hiding that conglomerate so that your code can be easy to read, but will still have the features you want. This means that your code will still say "read a string of any amount of characters from the user", but it will do so in one line, not several. This isn't convenience, it's code readability.
3) You could write your own functions that will read a block of data, and return the variable size array. This is a good solution, except that it's basically what std::string does for you. This is convenience, but it also means that you know the code work, without testing.

Quote:

I most certainly understand that both client and server need to be speaking the same encoding but what libraries are you referring to? All you really need to communicate is a buffer and the standard library. Give me a socket, send, recv, and a buffer and I'll give you a client/server app :)
I guess my problem here is, for example, in ASCII I can count on things like "if I find a \r\n then I know the server is done sending information", which I don't know if I can do with UTF-8 encoded data (or even how to input it into a String() buffer)...

I don't know much about the details of UTF-8, so I can't tell you how to detect an appropriate terminator. Read up on UTF-8, or find a library that figures this stuff out for you. Using a variable width encoding like UTF-8 is a good practice here though, because generally the overhead of sending a large data segments is greater than the overhead of converting to fixed width.