wchar_t type

**gustavosserra** · 11-02-2003

I was looking for new things to learn, when I found this type wchar_t. The purpose of this type, according to what I have read, is to store Unicode. But the size of wchar_t is 2, and as far as I know, it is needed 4 bytes to store an Unicode. And one more detail: when I print the wchar_t with cout (of course) I get a number, not a char.
My questions are:
1) Is this wchar_t correct?
2) Is wchar_t ANSI? If not, can I rely on what ANSI type to store unicode?
3) I think that wchar_t is a typedef for another basic type, but I do not know any basic type that has 2 bytes. Does anyone know any type that has 2 bytes on Duron (using Windows XP)?
Thanks in advance!!!

**Prelude** · 11-02-2003

>But the size of wchar_t is 2
It doesn't have to be 2 bytes any more than char has to be 8 bits.

>1) Is this wchar_t correct?
That would be a safe assumption. I've found no use for it yet, but I have little doubt the implementation is correct. If I didn't trust my compiler or the standard, I'd have a lot of trouble being a programmer. Trust the tools you use or you'll be doomed to rewrite them poorly.

>2) Is wchar_t ANSI?
Yes.

>3) I think that wchar_t is a typedef for another basic type, but I do not know any basic type that has 2 bytes.
short is usually 2 bytes, but it doesn't have to be. The internal representation is irrelevant anyway, trust that it's big enough for your purposes.

**Sang-drax** · 11-02-2003

cout doesn't handle wchar_t, that is why you are printing a number instead of a character. The type of cout is basic_ostream<char,...>, you will have to use wcout which is basic_ostream<wchar_t,...> (or wostream)

**Cat** · 11-02-2003

Originally posted by gustavosserra
[B]as far as I know, it is needed 4 bytes to store an Unicode.

Let's make a distinction. Unicode is a character set. There are different ENCODINGS of this character set (different ways to map code points <-> glyphs).

A glyph is a unit of text. For example, each ASCII character is a glyph. Some complicated characters may be a combination of several glyphs (for example, a u with an umlaut over it could be stored as the glyph for u and then the glyph for umlaut, and it would be expected the program would properly display the 2 glyphs as a single character). Further, there are cases where multiple characters could be stored as one glyph. Some glyphs aren't visible, but control how text is displayed (e.g. whether it should be right->left or left->right). Further, the same character can be created with different glyphs -- e.g. you could use two glyphs for "u with umlaut" as mentioned above, but there is also a single glyph for "u with umlaut"; same character but with different glyphs.

A code point is a unit of memory, like a char in a char * array, a wchar_t in a wchar_t[] array, etc.

Unicode is one character set (one set of glyphs) but there are MULTIPLE ways to encode it. Popular methods are UTF-8,UTF-32, and UTF-16; UTF-16 is what WinNT uses internally.

To be *guaranteed* that each Unicode glyph takes up exactly one code point, yes, you need 32 bits (UTF-32). However, UTF-32 is inefficient because something like 95% of the current Unicode character set would have 16 leading zeros.

Most modern implementations of Unicode use UTF-16 encoding. The vast majority of the glyphs will occupy one 16-bit code point, but some glyphs may take up two 16-bit code points.

There are other encodings. For example, UTF-8 uses 8-bit code points, so you could use it with char* strings. However, you would have to be aware that a single glyph, for example a japanese hiragana 'a', would occupy more than one code point (it would take up at least 2 positions in your char* array).

It is also true that you can have glyphs take up multiple code points with UTF-16, but with UTF-16, it's vastly less likely (and you might be able to get away with the assumption that one code point = one glyph, depending on how robust of Unicode support you want.) For simple, non-robust applications you can assume that one UTF-16 code point = one glyph = one character, and this will, most of the time, be true.

Further, be aware that Windows consoles, at least in VC++, don't allow Unicode output, even with wcout. They internally convert the Unicode back to a multibyte char* string with whatever your default console code page is, and output that. I have yet to find a way around this. Perhaps another compiler may work differently; VC++ always uses 8-bit code pages for console output, and I can't figure out if it's possible to make it use UTF-8.

Essentially, think of Unicode like this:

characters <-> glyphs <-> code points

What we read as text is a sequence of characters. This is mapped (not necessarily uniquely, as in the case of u with umlaut) to a set of glyphs.

The sequence of glyphs is the *logical* representation of the character sequence. This is mapped via encoding methods (UTF-8, UTF-16, UTF-32) to a set of code points.

The sequence of code points is the *physical* representation of the character sequence in memory. Note that it is possible for different code point sequences to represent the same glyph sequence. For example, you could encode in UTF-8 and UTF-16, and would end up with two different code point sequences, but they would both represent the same glyph sequence (thus the same character sequence).

**gustavosserra** · 11-02-2003

Uhm... it is just a little bit more complicated than I had thought. But I think that I got it! Thank you all!!!!
Just to test myself: wchar_t can´t guarantee that I will work with Unicode. I mean wchar_t is just a sequence of bytes, not necessarily with a semantic. My application must give the proper meaning to wchar_t.
Correct?

**Cat** · 11-02-2003

Yes.

You need to make your application properly interpret the wchar_t array. Just like with a char * string, you need to interpret it AS a string. You can use a char[] array to store things that are not strings, but you shouldn't call string functions on them.

WinNT natively uses UTF-16, and so you can use wchat_t strings, in combination with WinAPI functions to successfully use Unicode in your programs. It handles most of the work behind the scenes, but you need to be careful if you're trying to modify characters yourself.

Thread: wchar_t type

Thread Tools

Search Thread

Display

wchar_t type

Re: wchar_t type

Similar Threads

Problem in compiling targa2.c: error: field `ip' has incomplete type..

How to fix misaligned assignment statements in the source code?

Compiler "Warnings"

Errors

gcc problem