Localization questions

**X PaYnE X** · 08-20-2006

Hello,

Most of the applications Ive worked on have been targetted at English users only. Recently, I wanted to change that and add all (or most) languages.

I read this article: http://www.codeproject.com/tips/inte...nalization.asp
The writer only refers to ASCII and Unicode, I've looked around and found some more encodings (UTF-(8/16/32)) and now I'm completely lost.

Which encoding should I use? and why? Is there a standard encoding? I know Windows XP (and possibly earlier, not sure) uses Unicode. But I've read that XML and web applications use or prefer UTF-8.

Also, how do people design their interfaces for different languages? Lets say I designed my interface in Chinese (where it uses less characters per word) and then allowed the user to change the language to English, that would ruin the interface (the text wouldnt 'fit'). Would I have the interface 'stretch' itself based on the language in use?

Thanks in advance for any help.

**Salem** · 08-20-2006

Unicode is a character set - http://en.wikipedia.org/wiki/Unicode
UTF-8 is an encoding of that character set to facilitate interoperability - http://en.wikipedia.org/wiki/Utf-8
There are many different ways of encoding unicode, depending on your requirements.

Internally, your program would use unicode characters.
Externally, you could encode the text using whatever format(s) you thought would be useful. On input, you would convert say utf-8 into unicode for display purposes.

> Would I have the interface 'stretch' itself based on the language in use?
Most font APIs allow you to measure the size of the rendered string before you actually do it.

**X PaYnE X** · 08-20-2006

Thanks for the reply.

**Cat** · 08-20-2006

Originally Posted by X PaYnE X

Hello,

Most of the applications Ive worked on have been targetted at English users only. Recently, I wanted to change that and add all (or most) languages.

I read this article: http://www.codeproject.com/tips/inte...nalization.asp
The writer only refers to ASCII and Unicode, I've looked around and found some more encodings (UTF-(8/16/32)) and now I'm completely lost.

Which encoding should I use? and why? Is there a standard encoding? I know Windows XP (and possibly earlier, not sure) uses Unicode. But I've read that XML and web applications use or prefer UTF-8.

Also, how do people design their interfaces for different languages? Lets say I designed my interface in Chinese (where it uses less characters per word) and then allowed the user to change the language to English, that would ruin the interface (the text wouldnt 'fit'). Would I have the interface 'stretch' itself based on the language in use?

Thanks in advance for any help.

UTF-8, UTF-16, and UTF-32 are all Unicode, they are different ways to encode the Unicode character set. The only difference is if the basic unit of storage is 8, 16, or 32 bits; UTF-8 stores a string as a series of 8-bit values, while UTF-32 stores a string as a sequence of 32 bit values. They each have different advantages:

UTF-8
Advantages

Backwards compatible with ANSI text -- that is, all characters with value 127 and below are unchanged. They take only one byte to encode. A pure ANSI text is identical in UTF-8.
Further, values less than 127 will never occur within a multibyte character, so if you are, say, scanning for " as a delimiter, you'll never have a false positive (as you could in other text encodings, like Shift-JIS).
Popular for web applications
It's easy to figure out how many bytes a multibyte character takes up.

Disadvantages

Different characters require different amounts of bytes to encode. ANSI take only one, most foreign characters take two or three, and some (principally extinct languages or nonlinguistic symbols such as musical ones) take four.

UTF-16:
Advantages

Virtually all modern languages will take a single 16-bit value per character.
This is the native character encoding for Windows NT, 2000, and XP. All WinAPI functions requiring strings can use UTF-16; Windows programs are thus very easy to use UTF-16 within. The WinAPI type WCHAR is designed for UTF-16.
This is also the native character encoding for Java and .NET.
Many C++ compilers use wchar_t, std::wstring, etc. as two-byte characters which fits well with UTF-16. I believe there's no standard length for wchar_t though so this behavior can't be counted on cross-platform.

Disadvantages

Not directly backwards compatible with ANSI. The letter 'A' for example, in ANSI, is 0x41. In UTF16 it's 0x0041.
A document in pure ANSI text will double in size.

UTF-32:
Advantages

Every character can be stored in exactly one 32-bit value.

Disadvantages

Not directly backwards compatible with ANSI. The letter 'A' for example, in ANSI, is 0x41. In UTF32 it's 0x00000041.
A document in pure ANSI text will quadruple in size.

If you're programming for a Windows GUI, you'll definately want to internally store strings as UTF-16. Compiling with Unicode support will mean that every API function that expects a string is now expecting a UTF-16 string.

You may often find uses for UTF-8 though as well -- e.g. I have a program which accesses/populates a mySQL database, which is also read by a web client. The database and the web client are all using UTF-8. The GUI program internally uses UTF-16 (Windows' native encoding) and converts back and forth when accessing the database.

The nice thing about conversion is it's always possible (assuming your data string isn't corrupt; not every possible byte sequence is a valid UTF-8 or UTF-16 string). Every Unicode character can be encoded either in UTF-8 or UTF-16.

The WinAPI has some nice functions, MultiByteToWideChar & WideCharToMultiByte, which can convert between them, although I usually encapsulate this functionality in a wrapper class as it's not just a single line of code kind of deal. Generally, you first use the function to get the length of the buffer you need, allocate the buffer, and then call it again to do the conversion.

As a general rule, I use UTF-16 almost exclusively. I don't worry too much about text files being doubled in size, or at least not worried enough to think it's worth the effort to convert to another encoding. Working in the native WinXP encoding and the encoding that all WinAPI functions are using is pretty ideal for me.

I write every application exclusively in Unicode -- I applaud you if you're doing the same. Nothing is more frustrating than having data with text, filenames, etc. in a different language and trying a plethora of programs until you get one that won't barf, or having to mess around with the system code pages just to get a program to work or a file to open.

**Cat** · 08-20-2006

Oh, and if you are using Windows, they use some confusing terminology:

'Wide character' or 'Unicode character' always refers to a UTF-16 value. Note: Windows NT 4.0 supports only a subset of UTF-16 called UCS-2; such a character encoding is not "real" Unicode in that it cannot encode all possible Unicode characters, but unless you're typing in dead languages it's unlikely to affect you. Windows 2000/XP and beyond use true UTF-16. Wide character strings are arrays of type WCHAR.

'Multibyte character' refers to any other encoding, even UTF-7, UTF-8, or UTF-32/UCS-4 (which are true Unicode as well). Multibyte character arrays are arrays of type CHAR.

**CornedBee** · 08-21-2006

Note that wchar_t is a 32-bit type in all incarnations of GCC except for the Windows ports.

**X PaYnE X** · 08-23-2006

Sorry for the delayed reply, I had it set to notify me by email but for some reason it skipped the last 3 posts.
Anyway, I just wanted to thank you Cat for the very detailed explanation. It was exactly what I needed to know.

**Cat** · 08-23-2006

I'm happy to help. For how often I've been frustrated by having to deal with different character sets (which usually requires 2 reboots of the computer to get the results I want) I'm very glad for a chance to help others learn to use Unicode. If you have any further questions please feel free to ask.

I, for one, will be happy when Unicode is universally used and all those thousands of different code pages but a distant memory.

**siavoshkc** · 08-23-2006

Yeah, thank you Cat. I wanted to konw about these stuffs too.

Thread: Localization questions

Thread Tools

Search Thread

Display

Localization questions

Similar Threads

questions....so many questions about random numbers....

A very long list of questions... maybe to long...

Several Questions, main one is about protected memory

Trivial questions - what to do?

questions questions questions.....