Thread: I18n - how useful is wchar_t

  1. #1
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262

    I18n - how useful is wchar_t

    Hello all,

    I'm writing a program in C++. From the start, I'm keeping in mind it might be translated to other languages. I'm looking into internationalization only to be a lot more confused on the topic.
    Everything I read says wchar_t is the way to go. Easy enough. However, I know wchar_t is only 2 bytes on Windows (or at least in VC++). The Unicode standard states there are 100,713 characters in there. Which is a lot more than 2 bytes can hold (up to 65536 values).

    So my question is, is wchar_t really useful? Are the rest of the Unicode characters simply 'dead language' characters then? Or should I use another method for I18n?

    Thanks in advance,
    EVOEx

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Unicode defines a 31-bit character set. However, by also defining an 8-bit compressed version and a 16-bit compressed version, where those characters that do not fit within 8 or 16 bits have a special escape code to identify the full 31-bit character code.

    So a 16-bit wchar_t can describe either a single character, or a prefix for the next character that belongs in a different set of 16-bits.

    Have a look at:
    http://www.cl.cam.ac.uk/~mgk25/unicode.html

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Thank you for your reply.

    I know about UTF-8/16. So, that means that string.length() is not necessarily the number of characters in the string, but rather the number of characters that store the 'compressed' string?
    How would that work with std::toupper then?

    Thanks

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Generally, the standard text functions do not work correctly with UTF-8 or UTF-16.

    I'm pretty sure toupper is only defined for ASCII characters. But I could be wrong - although I could not find any evidence thereof.

    Also, I'm not at all convinced that all languages actually have upper and lower case as such. But again, this is more speculation than knowledge.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    If you've used .NET, you'll know that there's even a UTF-7 encoding. Might take quite a few bytes to represent certain chars, but they'd have to all be representable somehow.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. wchar_t, i18n, l10n and other oddities
    By samblack in forum C++ Programming
    Replies: 5
    Last Post: 05-09-2008, 06:57 PM

Tags for this Thread