Thread: wchar_t type

  1. #1
    Disturbed Boy gustavosserra's Avatar
    Join Date
    Apr 2003
    Posts
    244

    wchar_t type

    I was looking for new things to learn, when I found this type wchar_t. The purpose of this type, according to what I have read, is to store Unicode. But the size of wchar_t is 2, and as far as I know, it is needed 4 bytes to store an Unicode. And one more detail: when I print the wchar_t with cout (of course) I get a number, not a char.
    My questions are:
    1) Is this wchar_t correct?
    2) Is wchar_t ANSI? If not, can I rely on what ANSI type to store unicode?
    3) I think that wchar_t is a typedef for another basic type, but I do not know any basic type that has 2 bytes. Does anyone know any type that has 2 bytes on Duron (using Windows XP)?
    Thanks in advance!!!
    Nothing more to tell about me...
    Happy day =)

  2. #2
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >But the size of wchar_t is 2
    It doesn't have to be 2 bytes any more than char has to be 8 bits.

    >1) Is this wchar_t correct?
    That would be a safe assumption. I've found no use for it yet, but I have little doubt the implementation is correct. If I didn't trust my compiler or the standard, I'd have a lot of trouble being a programmer. Trust the tools you use or you'll be doomed to rewrite them poorly.

    >2) Is wchar_t ANSI?
    Yes.

    >3) I think that wchar_t is a typedef for another basic type, but I do not know any basic type that has 2 bytes.
    short is usually 2 bytes, but it doesn't have to be. The internal representation is irrelevant anyway, trust that it's big enough for your purposes.
    My best code is written with the delete key.

  3. #3
    S Sang-drax's Avatar
    Join Date
    May 2002
    Location
    Göteborg, Sweden
    Posts
    2,072
    cout doesn't handle wchar_t, that is why you are printing a number instead of a character. The type of cout is basic_ostream<char,...>, you will have to use wcout which is basic_ostream<wchar_t,...> (or wostream)
    Last edited by Sang-drax; 11-02-2003 at 02:48 PM.
    Last edited by Sang-drax : Tomorrow at 02:21 AM. Reason: Time travelling

  4. #4
    Registered User
    Join Date
    May 2003
    Posts
    1,619

    Re: wchar_t type

    Originally posted by gustavosserra
    [B]as far as I know, it is needed 4 bytes to store an Unicode.
    Let's make a distinction. Unicode is a character set. There are different ENCODINGS of this character set (different ways to map code points <-> glyphs).

    A glyph is a unit of text. For example, each ASCII character is a glyph. Some complicated characters may be a combination of several glyphs (for example, a u with an umlaut over it could be stored as the glyph for u and then the glyph for umlaut, and it would be expected the program would properly display the 2 glyphs as a single character). Further, there are cases where multiple characters could be stored as one glyph. Some glyphs aren't visible, but control how text is displayed (e.g. whether it should be right->left or left->right). Further, the same character can be created with different glyphs -- e.g. you could use two glyphs for "u with umlaut" as mentioned above, but there is also a single glyph for "u with umlaut"; same character but with different glyphs.

    A code point is a unit of memory, like a char in a char * array, a wchar_t in a wchar_t[] array, etc.

    Unicode is one character set (one set of glyphs) but there are MULTIPLE ways to encode it. Popular methods are UTF-8,UTF-32, and UTF-16; UTF-16 is what WinNT uses internally.

    To be *guaranteed* that each Unicode glyph takes up exactly one code point, yes, you need 32 bits (UTF-32). However, UTF-32 is inefficient because something like 95% of the current Unicode character set would have 16 leading zeros.

    Most modern implementations of Unicode use UTF-16 encoding. The vast majority of the glyphs will occupy one 16-bit code point, but some glyphs may take up two 16-bit code points.

    There are other encodings. For example, UTF-8 uses 8-bit code points, so you could use it with char* strings. However, you would have to be aware that a single glyph, for example a japanese hiragana 'a', would occupy more than one code point (it would take up at least 2 positions in your char* array).

    It is also true that you can have glyphs take up multiple code points with UTF-16, but with UTF-16, it's vastly less likely (and you might be able to get away with the assumption that one code point = one glyph, depending on how robust of Unicode support you want.) For simple, non-robust applications you can assume that one UTF-16 code point = one glyph = one character, and this will, most of the time, be true.

    Further, be aware that Windows consoles, at least in VC++, don't allow Unicode output, even with wcout. They internally convert the Unicode back to a multibyte char* string with whatever your default console code page is, and output that. I have yet to find a way around this. Perhaps another compiler may work differently; VC++ always uses 8-bit code pages for console output, and I can't figure out if it's possible to make it use UTF-8.

    Essentially, think of Unicode like this:

    characters <-> glyphs <-> code points

    What we read as text is a sequence of characters. This is mapped (not necessarily uniquely, as in the case of u with umlaut) to a set of glyphs.

    The sequence of glyphs is the *logical* representation of the character sequence. This is mapped via encoding methods (UTF-8, UTF-16, UTF-32) to a set of code points.

    The sequence of code points is the *physical* representation of the character sequence in memory. Note that it is possible for different code point sequences to represent the same glyph sequence. For example, you could encode in UTF-8 and UTF-16, and would end up with two different code point sequences, but they would both represent the same glyph sequence (thus the same character sequence).
    Last edited by Cat; 11-02-2003 at 03:28 PM.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  5. #5
    Disturbed Boy gustavosserra's Avatar
    Join Date
    Apr 2003
    Posts
    244
    Uhm... it is just a little bit more complicated than I had thought. But I think that I got it! Thank you all!!!!
    Just to test myself: wchar_t can´t guarantee that I will work with Unicode. I mean wchar_t is just a sequence of bytes, not necessarily with a semantic. My application must give the proper meaning to wchar_t.
    Correct?
    Nothing more to tell about me...
    Happy day =)

  6. #6
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Yes.

    You need to make your application properly interpret the wchar_t array. Just like with a char * string, you need to interpret it AS a string. You can use a char[] array to store things that are not strings, but you shouldn't call string functions on them.

    WinNT natively uses UTF-16, and so you can use wchat_t strings, in combination with WinAPI functions to successfully use Unicode in your programs. It handles most of the work behind the scenes, but you need to be careful if you're trying to modify characters yourself.
    Last edited by Cat; 11-02-2003 at 04:51 PM.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 0
    Last Post: 03-20-2008, 07:59 AM
  2. Replies: 28
    Last Post: 07-16-2006, 11:35 PM
  3. Compiler "Warnings"
    By Jeremy G in forum A Brief History of Cprogramming.com
    Replies: 24
    Last Post: 04-24-2005, 01:09 PM
  4. Errors
    By Rhidian in forum C Programming
    Replies: 10
    Last Post: 04-04-2005, 12:22 PM
  5. gcc problem
    By bjdea1 in forum Linux Programming
    Replies: 13
    Last Post: 04-29-2002, 06:51 PM