Unicode - a lot of confusion...

**Jumper** · 06-29-2004

i posted similiar post on another forum but still no answer there so i'm trying here.

pagecode, threads, text-based controls, local(global), wide-character, multi-byte characters

can anyone please explain how these terms are dependent on each other ?

my current understanding is that wide-character/multi-byte characters are schemas for storing character codes and pagecode maps some specific code to character glyph(font entry)...

i have a lot of confusion on the subject, i know how to make my program unicode-aware but actually get it all work together is pretty hard...
i searched and read a lot of info on unicode subject but no source is actually explains how unicode entities are interconnected...

here is just a few questions i have:
----------------------------------------

How local and pagecode related each to other?

--------------------------------------------------------------------------

say, i have a text-based control and documentation says that it uses a local to output text OR it uses some pagecode for symbol translation...
so if i want to print each line of text in different language how is that possible? pagecode can represent simulteneously only 3 languages and local only 2....

----------------------------------------------------------------------------

i read also that i can setup a different local for each thread??
are threads locale-aware? and what's it good for?

-----------------------------------------------------------------------------

another,
as i see the win32 unicode app can treat symbols as wide-character(two-byte character) or as multi-bytes character...
i know what is a multi-byte character encoding (UTF-8, UTF-7) but is the above case refers to how characters are stored in memory?

------------------------------------------------------------------------

more about pagecodes and local,
which win32 objects are depend on pagecode and/or local and is this means that these object are limited to simultenouesly handling only 3 or 2 types of glyphes(languages)?

------------------------------------------------------------------------

for example:
i want to write a program that reads a unicode file with a few line in different language, say UTF-8.
now, i know that there are routines which are local based and unicode based.
as i see it, if i use a locale aware routine then the file won't be read properly because the local translates character codes according to some pagecode(global?) and it will be mapped to wrong glyphs... so i must use general unicode i/o functions...
so what are local dependant function good for?

------------------------------------------------------------------------------

i wrote a lot and maybe some questions are not formulated well but i hope someone can put the things on their place

thanks

**anonytmouse** · 06-29-2004

Here goes...

Originally Posted by Jumper

i posted similiar post on another forum but still no answer there so i'm trying here.

pagecode, threads, text-based controls, local(global), wide-character, multi-byte characters

can anyone please explain how these terms are dependent on each other ?

OK, first thing first, a pagecode is actually a codepage and a local is actually a locale.

my current understanding is that wide-character/multi-byte characters are schemas for storing character codes and pagecode maps some specific code to character glyph(font entry)...

Yes, that is pretty much correct. Consider the value 156. Using the 1252 (ansi Latin 1) codepage this value may map to one character while using the 50936 (simplified Chinese) codepage it will likely map to a totally different character.

Unicode aims to be a universal character set that contains characters for every script. Therefore it is not dependent on codepages. The value 156 maps to the same character wheter your computer is set up for Chinese or Portugese.

A locale holds various language and format settings. For example consider we want today's date in short format.

If we pass a US locale to GetDateFormat() we will get:
6-30-04
However, if we pass a UK locale to GetDateFormat() we will get:
30-6-04

You can see the Locale Information page for more details on what a windows locale can control.

A locale may also hold details on the currently used codepage. As well as Windows locales there is also C locales.

say, i have a text-based control and documentation says that it uses a local to output text OR it uses some pagecode for symbol translation...
so if i want to print each line of text in different language how is that possible?
...
which win32 objects are depend on pagecode and/or local and is this means that these object are limited to simultenouesly handling only 3 or 2 types of glyphes(languages)?

As mentioned previously, unicode can handle characters for every language (although you may not have appropriate fonts installed). If not using unicode, you can only use characters available in the current code page.

for example:
i want to write a program that reads a unicode file with a few line in different language, say UTF-8.
now, i know that there are routines which are local based and unicode based.
as i see it, if i use a locale aware routine then the file won't be read properly because the local translates character codes according to some pagecode(global?) and it will be mapped to wrong glyphs... so i must use general unicode i/o functions...
so what are local dependant function good for?

You're making this way too complex.
- Read the UTF-8 into a LPSTR.
- Convert to unicode using MultiByteToWideChar(CP_UTF8, ...);
- Use the resulting unicode string.

Sample code to add a UTF8 string to a list box.

Code:

void AddUTF8ToListBox(HWND hwndListBox, LPCSTR szUTF8)
{
	WCHAR szW[1024];

	if (0 != MultiByteToWideChar(CP_UTF8, 0, szUTF8, -1, szW, 1024))
	{
		SendMessageW(hwndListBox, LB_ADDSTRING, 0, (LPARAM) szW);
	}
}

Note: Windows 95/98/ME does not support unicode so, on those platforms, you are stuck with using characters from the current code page or having to use owner drawn.

**Jumper** · 06-30-2004

ok, thanks

i think, that's clears up the matters

one more question i have is:

how many codepages the proccess(app) can have?
is there a something called global codepage which applies through all code and all threads?

**anonytmouse** · 06-30-2004

>> how many codepages the proccess(app) can have? <<

I'm not sure what you mean here. Could you elaborate?

>> is there a something called global codepage which applies through all code and all threads? <<

Yes, this is sometimes called the ansi code page. For example when you call:

Code:

SendMessageA(hwndListBox, LB_ADDSTRING, 0, (LPARAM) "my_string");

windows converts "my_string" to unicode using the default ansi code page. I'm not sure if you can change this code page, but you would probably not want to.

For applications that must support input in several different code pages(also called char sets), such as a web browser or email client, the input should be read in and then converted to unicode using MultiByteToWideChar().

If an application needs to provide non-unicode output it should use WideCharToMultiByte(). Either way, a modern multi-lingual windows application should do all internal work in unicode and only convert, if needed, on input and output.

For example, with a page that specifies a character set of:

Code:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

a browser could use:

Code:

MultiByteToWideChar(51932, ...);

If you could elaborate on what you're trying to do, possibly we could provide further help.

**Jumper** · 06-30-2004

ok...
another...

[1]
is it true that if a compile a unicode-ware app and run it, the system will load say ISO 10646(aka Unicode UCS-2 Little-Endian ) as default global codepage that will be used by all controls of my app?

[2]
and another one related to codepages and fonts...

supose system loads ISO 10646 as a codepage for my app and i use some font, say in textbox, to write text. as i see it, i can't use any font i like? right?
(because not every font contains every character defined by above codepage)

[3]
There is also a distinction between input codepage and output codepage?

**anonytmouse** · 07-01-2004

>> is it true that if a compile a unicode-ware app and run it, the system will load say ISO 10646(aka Unicode UCS-2 Little-Endian ) as default global codepage that will be used by all controls of my app? <<

No, not really. Nearly all functions that accept strings are split into two versions. An A version and a W version. The W version accepts unicode strings and the A version accepts strings in the default ansi code page. The unadorned version maps to the A or W version depending on whether unicode is defined.

Code:

#ifdef UNICODE
#define SetConsoleTitle SetConsoleTitleW
#else
#define SetConsoleTitle SetConsoleTitleA
#endif

Therefore:

Code:

SetConsoleTitle(TEXT("app"));

// becomes if UNICODE is defined:
SetConsoleTitleW(L"app");

// otherwise becomes
SetConsoleTitleA("app");

The typical implementation of an A function is to convert string arguments to unicode using the default code page (CP_ACP) and call the W version (and convert back after return if needed).
Therefore, if UNICODE is defined, you will be passing unicode strings to your controls because you will be implicitly using functions like SetWindowTextW and SendMessageW. However, the default ansi code page will not have changed.

>>supose system loads ISO 10646 as a codepage for my app and i use some font, say in textbox, to write text. as i see it, i can't use any font i like? right?
(because not every font contains every character defined by above codepage) <<

I'm not sure. There is certainly methods to create a combined font if a single one is not enough, whether the windows controls use this, I am not sure.

You may need specialist help. I'd suggest microsoft.public.win32.programmer.international.

**Jumper** · 07-01-2004

>> is it true that if a compile a unicode-ware app and run it, the system will load say ISO 10646(aka Unicode UCS-2 Little-Endian ) as default global codepage that will be used by all controls of my app? <<

sorry,let me rephrase the question:
if i compile a unicode app does this mean that the system will load a codepade (like ISO 10646) for my app?
(if my app will continue to use ansi codepage then it is only possible to support 2 languages)

**anonytmouse** · 07-01-2004

I'm not sure what you are saying. Unicode can hold characters for all languages. It is code page independent.

**Jumper** · 07-02-2004

you said before that system loads ansi codepage for app. if my app is unicode-app then what codepage will sytem load(i thought it must be something like ISO 10646)?

i hope it is clearer...

**anonytmouse** · 07-02-2004

We seem to be going around in circles.

>> if my app is unicode-app then what codepage will sytem load(i thought it must be something like ISO 10646)? <<

No. The default ansi code page will not change whatever app you load. There is no hard distinction between a unicode app and a non-unicode app. The unicode app simply uses the W functions while a non-unicode app uses the A functions. Some apps use a mixture.

Again, for a program that deals only in unicode, the ansi code page is irrelevant.

this site may be helpful:
http://www.microsoft.com/globaldev/DrIntl/default.mspx

**Jumper** · 07-02-2004

ok
thanks for the link...

i experimenting now with console programming and i have a question related to this thread subject.
i use ReadConsoleInput() function to read keyboard events.

KEY_EVENT_RECORD structure contains two fields: wVirtualKeyCode and wVirtualScanCode.
documentation explains these fields:

wVirtualKeyCode
Virtual-key code that identifies the given key in a device-independent manner.

wVirtualScanCode
Virtual scan code of the given key that represents the device-dependent value generated by the keyboard hardware

can you please explain these explanations? (what is the meaning
of device-independent manner)

[EDIT]
i looked at this site and there is no explanations there to the kind of things i don't know

**Jumper** · 07-05-2004

anonytmouse, thanks for the link you gave me...
now i reading some article which is turned to be very usefull
here is the link to this article if someone else also struggles with unicode

http://www.microsoft.com/globaldev/g...s/wrguide.mspx

Thread: Unicode - a lot of confusion...

Thread Tools

Search Thread

Display

Unicode - a lot of confusion...

Similar Threads

Unicode in C

Unicode and Multibyte

Unicode & Good Program Design :: C++

UNICODE and GET_STATE

Program chews up a lot of CPU when it closes