Thread: wide character (unicode) and multi-byte character

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    1,579

    wide character (unicode) and multi-byte character

    Hello everyone,


    Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when talking with another team -- codepage -- at the same time.

    I am more confused when I saw sometimes we need codepage parameter for wide character conversion, and sometimes we do not need for conversion. Here are two examples,

    code page is used in WideCharToMultiByte when dealing with unciode character

    Code:
    int WideCharToMultiByte (
        UINT    CodePage,
        DWORD   dwFlags,
        LPCWSTR lpWideCharStr,
        int     cchWideChar,
        LPSTR   lpMultiByteStr,
        int     cbMultiByte,
        LPCSTR  lpDefaultChar,
        LPBOOL  lpUsedDefaultChar );
    code page is not used in wcstombs when dealing with unciode character

    Code:
    size_t wcstombs (
        char*          mbstr,
        const wchar_t* wcstr,
        size_t         count );
    My question is, what is codepage (seems my current understanding is not correct)? Does codepage have anything to do with multi-byte character or only have relationship with wide character? Could anyone explain the meaning and relationship between codepage, wide character and multi-byte character?


    thanks in advance,
    George
    Last edited by Dave_Sinkula; 05-02-2007 at 09:12 PM. Reason: Code tags: learn 'em, live 'em, love 'em.

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Wide characters in Windows are always in the UTF-16 encoding. Code pages are irrelevant.

    Multi-byte characters, on the other hand, can be in one of many encodings. In Windows, these encodings are called codepages. Important codepages are e.g. 431 (US English in DOS times), 850 (Western Europe in DOS times, contains diacritics like ö and &#252, 1252 (modern Western European), 1251 (modern Eastern European, contains Cyrillic characters), and some high number, 65001 I think, which denotes UTF-8.

    For the WinAPI function, you specify the code page directly, or you refer to a locale setting (CP_OEM, CP_ANSI etc. are special constants referring to a locale-set "OEM" (DOS legacy) or "ANSI" (modern Windows) codepages).
    The CRT function, on the other hand, always relies exclusively on the locale.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,


    Quote Originally Posted by CornedBee View Post
    Wide characters in Windows are always in the UTF-16 encoding. Code pages are irrelevant.

    Multi-byte characters, on the other hand, can be in one of many encodings. In Windows, these encodings are called codepages. Important codepages are e.g. 431 (US English in DOS times), 850 (Western Europe in DOS times, contains diacritics like ö and ü), 1252 (modern Western European), 1251 (modern Eastern European, contains Cyrillic characters), and some high number, 65001 I think, which denotes UTF-8.

    For the WinAPI function, you specify the code page directly, or you refer to a locale setting (CP_OEM, CP_ANSI etc. are special constants referring to a locale-set "OEM" (DOS legacy) or "ANSI" (modern Windows) codepages).
    The CRT function, on the other hand, always relies exclusively on the locale.
    I am confused about two basic concepts from your reply -- basic questions,

    1. What is a codepage? Seems UTF-16, UTF-8 is also a codepage, besides 431, 1251? I think UTF-8 and UTF-16 should be some encoding approach.

    2. Why in unicode 16 the codepage is not relevant?


    regards,
    George

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    1. A codepage is a mapping from byte sequences to the abstract concept of characters. In other words, it's the same as an encoding. For example, the ASCII encoding says "the byte with the value 65 corresponds to the Latin uppercase A". There is no way to canonically represent the character "Latin uppercase A" in a computer - it's always a matter of interpretation. Encodings specify interpretation. Typically, the operating system has some internal code table (the "platform character set") of abstract characters, and encodings are mappings from byte sequences into this table. The table, then can be used e.g. as a table of indices into a font file to find the glyph to display.

    I just realize that you might already know all this. So ... codepage is mainly a legacy (DOS times) name for encodings. (Not as flexible, though. I think the way codepages work they cannot represent stateful encodings such as Shift-JIS.)

    2. It's UTF-16, not "unicode 16". Such a thing doesn't exist. And I didn't say codepages are irrelevant in UTF-16 - on the contrary, I think Windows has a codepage number for UTF-16. I said that codepages are irrelevant to wide characters, because in Windows, wide characters are always encoded as UTF-16 and you can't change it.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee! Great reply!


    Quote Originally Posted by CornedBee View Post
    1. A codepage is a mapping from byte sequences to the abstract concept of characters. In other words, it's the same as an encoding. For example, the ASCII encoding says "the byte with the value 65 corresponds to the Latin uppercase A". There is no way to canonically represent the character "Latin uppercase A" in a computer - it's always a matter of interpretation. Encodings specify interpretation. Typically, the operating system has some internal code table (the "platform character set") of abstract characters, and encodings are mappings from byte sequences into this table. The table, then can be used e.g. as a table of indices into a font file to find the glyph to display.

    I just realize that you might already know all this. So ... codepage is mainly a legacy (DOS times) name for encodings. (Not as flexible, though. I think the way codepages work they cannot represent stateful encodings such as Shift-JIS.)

    2. It's UTF-16, not "unicode 16". Such a thing doesn't exist. And I didn't say codepages are irrelevant in UTF-16 - on the contrary, I think Windows has a codepage number for UTF-16. I said that codepages are irrelevant to wide characters, because in Windows, wide characters are always encoded as UTF-16 and you can't change it.
    1. I have read again the information of Windows API WideCharToMultiByte, I think from your

    http://msdn2.microsoft.com/en-us/library/aa450989.aspx

    description, the first parameter CodePage is not used since on Windows wide character is always encoded using UTF-16, right?

    2. I am wondering where can I find the mapping table of each codepage (or encoding)? (how a number is mapped to a character)


    regards,
    George

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by George2 View Post
    1. I have read again the information of Windows API WideCharToMultiByte, I think from your

    http://msdn2.microsoft.com/en-us/library/aa450989.aspx

    description, the first parameter CodePage is not used since on Windows wide character is always encoded using UTF-16, right?
    Wrong. It's used for the MultiByte part of the deal.

    2. I am wondering where can I find the mapping table of each codepage (or encoding)? (how a number is mapped to a character)
    Search the web. I believe unicode.org has mappings from various encodings to the Unicode table somewhere.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,


    Quote Originally Posted by CornedBee View Post
    Wrong. It's used for the MultiByte part of the deal.



    Search the web. I believe unicode.org has mappings from various encodings to the Unicode table somewhere.
    I have read through MSDN again for API WideCharToMultiByte, but I can not find where it is mentioned that the 1st parameter codepage is used for multibyte part (i.e. the output parameter). Could you kindly help to point out please?

    Previously, I think multibyte character is a specific encoding (codepage). But after discussion with you, I think I am wrong. Multibyte character on Windows is a general term which is used for representing character which could be stored in more than one byte. And several encoding (codepage), like UTF-8, ANSI, ... could be called as multibyte character. Is my understanding correct?

    If my understanding is correct, I am wondering the differences between multi-byte and wide character? I think they are both characters which are represented by more than one bytes. Why on Windows they are distinguished?


    regards,
    George

Popular pages Recent additions subscribe to a feed