Thread: Localization questions

  1. #1
    Registered User
    Join Date
    Aug 2003
    Posts
    288

    Localization questions

    Hello,

    Most of the applications Ive worked on have been targetted at English users only. Recently, I wanted to change that and add all (or most) languages.

    I read this article: http://www.codeproject.com/tips/inte...nalization.asp
    The writer only refers to ASCII and Unicode, I've looked around and found some more encodings (UTF-(8/16/32)) and now I'm completely lost.

    Which encoding should I use? and why? Is there a standard encoding? I know Windows XP (and possibly earlier, not sure) uses Unicode. But I've read that XML and web applications use or prefer UTF-8.

    Also, how do people design their interfaces for different languages? Lets say I designed my interface in Chinese (where it uses less characters per word) and then allowed the user to change the language to English, that would ruin the interface (the text wouldnt 'fit'). Would I have the interface 'stretch' itself based on the language in use?

    Thanks in advance for any help.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    Unicode is a character set - http://en.wikipedia.org/wiki/Unicode
    UTF-8 is an encoding of that character set to facilitate interoperability - http://en.wikipedia.org/wiki/Utf-8
    There are many different ways of encoding unicode, depending on your requirements.

    Internally, your program would use unicode characters.
    Externally, you could encode the text using whatever format(s) you thought would be useful. On input, you would convert say utf-8 into unicode for display purposes.

    > Would I have the interface 'stretch' itself based on the language in use?
    Most font APIs allow you to measure the size of the rendered string before you actually do it.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Aug 2003
    Posts
    288
    Thanks for the reply.

  4. #4
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Quote Originally Posted by X PaYnE X
    Hello,

    Most of the applications Ive worked on have been targetted at English users only. Recently, I wanted to change that and add all (or most) languages.

    I read this article: http://www.codeproject.com/tips/inte...nalization.asp
    The writer only refers to ASCII and Unicode, I've looked around and found some more encodings (UTF-(8/16/32)) and now I'm completely lost.

    Which encoding should I use? and why? Is there a standard encoding? I know Windows XP (and possibly earlier, not sure) uses Unicode. But I've read that XML and web applications use or prefer UTF-8.

    Also, how do people design their interfaces for different languages? Lets say I designed my interface in Chinese (where it uses less characters per word) and then allowed the user to change the language to English, that would ruin the interface (the text wouldnt 'fit'). Would I have the interface 'stretch' itself based on the language in use?

    Thanks in advance for any help.
    UTF-8, UTF-16, and UTF-32 are all Unicode, they are different ways to encode the Unicode character set. The only difference is if the basic unit of storage is 8, 16, or 32 bits; UTF-8 stores a string as a series of 8-bit values, while UTF-32 stores a string as a sequence of 32 bit values. They each have different advantages:

    UTF-8
    Advantages
    • Backwards compatible with ANSI text -- that is, all characters with value 127 and below are unchanged. They take only one byte to encode. A pure ANSI text is identical in UTF-8.
    • Further, values less than 127 will never occur within a multibyte character, so if you are, say, scanning for " as a delimiter, you'll never have a false positive (as you could in other text encodings, like Shift-JIS).
    • Popular for web applications
    • It's easy to figure out how many bytes a multibyte character takes up.

    Disadvantages
    • Different characters require different amounts of bytes to encode. ANSI take only one, most foreign characters take two or three, and some (principally extinct languages or nonlinguistic symbols such as musical ones) take four.



    UTF-16:
    Advantages
    • Virtually all modern languages will take a single 16-bit value per character.
    • This is the native character encoding for Windows NT, 2000, and XP. All WinAPI functions requiring strings can use UTF-16; Windows programs are thus very easy to use UTF-16 within. The WinAPI type WCHAR is designed for UTF-16.
    • This is also the native character encoding for Java and .NET.
    • Many C++ compilers use wchar_t, std::wstring, etc. as two-byte characters which fits well with UTF-16. I believe there's no standard length for wchar_t though so this behavior can't be counted on cross-platform.

    Disadvantages
    • Not directly backwards compatible with ANSI. The letter 'A' for example, in ANSI, is 0x41. In UTF16 it's 0x0041.
    • A document in pure ANSI text will double in size.


    UTF-32:
    Advantages
    • Every character can be stored in exactly one 32-bit value.

    Disadvantages
    • Not directly backwards compatible with ANSI. The letter 'A' for example, in ANSI, is 0x41. In UTF32 it's 0x00000041.
    • A document in pure ANSI text will quadruple in size.



    If you're programming for a Windows GUI, you'll definately want to internally store strings as UTF-16. Compiling with Unicode support will mean that every API function that expects a string is now expecting a UTF-16 string.

    You may often find uses for UTF-8 though as well -- e.g. I have a program which accesses/populates a mySQL database, which is also read by a web client. The database and the web client are all using UTF-8. The GUI program internally uses UTF-16 (Windows' native encoding) and converts back and forth when accessing the database.

    The nice thing about conversion is it's always possible (assuming your data string isn't corrupt; not every possible byte sequence is a valid UTF-8 or UTF-16 string). Every Unicode character can be encoded either in UTF-8 or UTF-16.

    The WinAPI has some nice functions, MultiByteToWideChar & WideCharToMultiByte, which can convert between them, although I usually encapsulate this functionality in a wrapper class as it's not just a single line of code kind of deal. Generally, you first use the function to get the length of the buffer you need, allocate the buffer, and then call it again to do the conversion.



    As a general rule, I use UTF-16 almost exclusively. I don't worry too much about text files being doubled in size, or at least not worried enough to think it's worth the effort to convert to another encoding. Working in the native WinXP encoding and the encoding that all WinAPI functions are using is pretty ideal for me.

    I write every application exclusively in Unicode -- I applaud you if you're doing the same. Nothing is more frustrating than having data with text, filenames, etc. in a different language and trying a plethora of programs until you get one that won't barf, or having to mess around with the system code pages just to get a program to work or a file to open.
    Last edited by Cat; 08-20-2006 at 11:37 PM.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  5. #5
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Oh, and if you are using Windows, they use some confusing terminology:

    'Wide character' or 'Unicode character' always refers to a UTF-16 value. Note: Windows NT 4.0 supports only a subset of UTF-16 called UCS-2; such a character encoding is not "real" Unicode in that it cannot encode all possible Unicode characters, but unless you're typing in dead languages it's unlikely to affect you. Windows 2000/XP and beyond use true UTF-16. Wide character strings are arrays of type WCHAR.

    'Multibyte character' refers to any other encoding, even UTF-7, UTF-8, or UTF-32/UCS-4 (which are true Unicode as well). Multibyte character arrays are arrays of type CHAR.
    Last edited by Cat; 08-21-2006 at 12:01 AM.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Note that wchar_t is a 32-bit type in all incarnations of GCC except for the Windows ports.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User
    Join Date
    Aug 2003
    Posts
    288
    Sorry for the delayed reply, I had it set to notify me by email but for some reason it skipped the last 3 posts.
    Anyway, I just wanted to thank you Cat for the very detailed explanation. It was exactly what I needed to know.

  8. #8
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    I'm happy to help. For how often I've been frustrated by having to deal with different character sets (which usually requires 2 reboots of the computer to get the results I want) I'm very glad for a chance to help others learn to use Unicode. If you have any further questions please feel free to ask.

    I, for one, will be happy when Unicode is universally used and all those thousands of different code pages but a distant memory.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  9. #9
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246
    Yeah, thank you Cat. I wanted to konw about these stuffs too.
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. questions....so many questions about random numbers....
    By face_master in forum C++ Programming
    Replies: 2
    Last Post: 07-30-2009, 08:47 AM
  2. A very long list of questions... maybe to long...
    By Ravens'sWrath in forum C Programming
    Replies: 16
    Last Post: 05-16-2007, 05:36 AM
  3. Several Questions, main one is about protected memory
    By Tron 9000 in forum C Programming
    Replies: 3
    Last Post: 06-02-2005, 07:42 AM
  4. Trivial questions - what to do?
    By Aerie in forum A Brief History of Cprogramming.com
    Replies: 23
    Last Post: 12-26-2004, 09:44 AM
  5. questions questions questions.....
    By mfc2themax in forum A Brief History of Cprogramming.com
    Replies: 1
    Last Post: 08-14-2001, 07:22 AM