I18n - how useful is wchar_t

**EVOEx** · 10-26-2008

Hello all,

I'm writing a program in C++. From the start, I'm keeping in mind it might be translated to other languages. I'm looking into internationalization only to be a lot more confused on the topic.
Everything I read says wchar_t is the way to go. Easy enough. However, I know wchar_t is only 2 bytes on Windows (or at least in VC++). The Unicode standard states there are 100,713 characters in there. Which is a lot more than 2 bytes can hold (up to 65536 values).

So my question is, is wchar_t really useful? Are the rest of the Unicode characters simply 'dead language' characters then? Or should I use another method for I18n?

Thanks in advance,
EVOEx

**matsp** · 10-26-2008

Unicode defines a 31-bit character set. However, by also defining an 8-bit compressed version and a 16-bit compressed version, where those characters that do not fit within 8 or 16 bits have a special escape code to identify the full 31-bit character code.

So a 16-bit wchar_t can describe either a single character, or a prefix for the next character that belongs in a different set of 16-bits.

Have a look at:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

--
Mats

**EVOEx** · 10-26-2008

Thank you for your reply.

I know about UTF-8/16. So, that means that string.length() is not necessarily the number of characters in the string, but rather the number of characters that store the 'compressed' string?
How would that work with std::toupper then?

Thanks

**matsp** · 10-26-2008

Generally, the standard text functions do not work correctly with UTF-8 or UTF-16.

I'm pretty sure toupper is only defined for ASCII characters. But I could be wrong - although I could not find any evidence thereof.

Also, I'm not at all convinced that all languages actually have upper and lower case as such. But again, this is more speculation than knowledge.

--
Mats

**iMalc** · 10-26-2008

If you've used .NET, you'll know that there's even a UTF-7 encoding. Might take quite a few bytes to represent certain chars, but they'd have to all be representable somehow.

Thread: I18n - how useful is wchar_t

Thread Tools

Search Thread

Display

I18n - how useful is wchar_t

Similar Threads

wchar_t, i18n, l10n and other oddities

Tags for this Thread