Thread: get wide character and multibyte character value

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    1,579

    get wide character and multibyte character value

    Hello everyone,


    I need to know the wide character (unicode) and multibyte (UTF-8) values of a character string of czech. I personally know nothing about czech. Is the following approach correct?

    1. I use L on the character string and watch memory to get the wide character representation of the character string in little endian form;

    2. I change the computer region/language to czech, and use function WideCharToMultiByte, and use CP_ACP as input code page and use the L character string as input to get the output multibyte character string output from parameter lpMultiByteStr.

    Is (1) and (2) correct? Any more efficient and smart ways?


    thanks in advance,
    George

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    1) You use the term Unicode wrong.
    2) You misunderstand what Unicode is all about.
    3) You misunderstand the effect of CP_ACP.

    So your approach is completely wrong.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Hi CornedBee,


    What is your solution? :-)

    Quote Originally Posted by CornedBee View Post
    1) You use the term Unicode wrong.
    2) You misunderstand what Unicode is all about.
    3) You misunderstand the effect of CP_ACP.

    So your approach is completely wrong.

    regards,
    George

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You can read the documentation of WideCharToMultiByte again.

    You can read more about Unicode.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Hi CornedBee,


    Could you provide some hints please? I have used WideCharToMultiByte before for a couple of times, and it converts wide character to multibyte character well as designed and as I stated. On Windows 32-bit platforms, it convers UTF-16 to UTF-8.

    Which point I am wrong? :-)


    Quote Originally Posted by CornedBee View Post
    You can read the documentation of WideCharToMultiByte again.

    You can read more about Unicode.

    regards,
    George

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    1) UTF-16 is not Unicode. Unicode is a combination of a character set (the Unicode character set, equivalent to the universal character set (UCS) of ISO-10046) and algorithms for comparing, sorting and casing that take into account the peculiarities of scripts and language from all over the world. UTF-16 and UTF-8 are encodings of the codepoints of this character sets into a byte stream. (More correctly, UTF-16 is an encoding to 16-bit units, and UTF-16LE and UTF-16BE define further serialization to byte streams.) Win32 programmers and the API docs are frighteningly careless in mixing up the two terms. It's as if UTF-16 was Unicode to Win32 programmers, which is stupid.

    2) Given that Unicode attempts to cover all languages at the same time, doing anything Czech-specific in the conversion of UTF-16 to UTF-8 (which is just an encoding conversion) is an absurd idea. The conversion means going through the UTF-16 string, decoding the code points (1 16-bit unit for most characters, 2 for surrogate pairs), and encoding them as UTF-8 (between 1 and 4 bytes) into the target buffer. There is absolutely nothing language-specific about this.

    3) CP_ACP uses the currently defined Ansi Code Page, as understood by the Win32 API. This code page is typically set to 437 (US) or 850 (Western European). By setting the language to Czech, you probably set the ACP to 851 (Eastern European). In any case, you won't get UTF-8.
    More to the point, it's absolutely disallowed to set the ACP to UTF-8, even though it's apparently possible. However, many API functions are not able to handle multi-byte characters longer than 2 bytes, so they can't work with UTF-8. See Michael Kaplan's multitude of comments on the subject.
    So to convert to UTF-8, you must pass CP_UTF8. The MSDN page could have told you that, but apparently you are banned from MSDN, because otherwise you surely would have followed my advice and known that already.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,


    I got more insight. I want to confirm with you that the wide character on Windows is UTF-16, right? So, I could use L on the character string to get its UTF-16 (wide character) binary value?

    (I am in debug mode and monitor the L character string buffer). Is it a correct approach to get the UTF-16 binary value of the character?

    Quote Originally Posted by CornedBee View Post
    1) UTF-16 is not Unicode. Unicode is a combination of a character set (the Unicode character set, equivalent to the universal character set (UCS) of ISO-10046) and algorithms for comparing, sorting and casing that take into account the peculiarities of scripts and language from all over the world. UTF-16 and UTF-8 are encodings of the codepoints of this character sets into a byte stream. (More correctly, UTF-16 is an encoding to 16-bit units, and UTF-16LE and UTF-16BE define further serialization to byte streams.) Win32 programmers and the API docs are frighteningly careless in mixing up the two terms. It's as if UTF-16 was Unicode to Win32 programmers, which is stupid.

    2) Given that Unicode attempts to cover all languages at the same time, doing anything Czech-specific in the conversion of UTF-16 to UTF-8 (which is just an encoding conversion) is an absurd idea. The conversion means going through the UTF-16 string, decoding the code points (1 16-bit unit for most characters, 2 for surrogate pairs), and encoding them as UTF-8 (between 1 and 4 bytes) into the target buffer. There is absolutely nothing language-specific about this.

    3) CP_ACP uses the currently defined Ansi Code Page, as understood by the Win32 API. This code page is typically set to 437 (US) or 850 (Western European). By setting the language to Czech, you probably set the ACP to 851 (Eastern European). In any case, you won't get UTF-8.
    More to the point, it's absolutely disallowed to set the ACP to UTF-8, even though it's apparently possible. However, many API functions are not able to handle multi-byte characters longer than 2 bytes, so they can't work with UTF-8. See Michael Kaplan's multitude of comments on the subject.
    So to convert to UTF-8, you must pass CP_UTF8. The MSDN page could have told you that, but apparently you are banned from MSDN, because otherwise you surely would have followed my advice and known that already.

    regards,
    George

  8. #8
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by George2 View Post
    I got more insight. I want to confirm with you that the wide character on Windows is UTF-16, right?
    Correct. It should also be noted that this is a violation of the C++ standard, albeit a minor one.

    So, I could use L on the character string to get its UTF-16 (wide character) binary value?
    You use L on the character string to make it a wide literal. If you subsequently look at the memory, you'll find the UTF-16 representation, yes. (UTF-16LE, to be specific.)
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #9
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks for your confirmation CornedBee!


    1.

    Quote Originally Posted by CornedBee View Post
    Correct. It should also be noted that this is a violation of the C++ standard, albeit a minor one.
    What the violation do you mean?

    2.

    Quote Originally Posted by CornedBee View Post
    You use L on the character string to make it a wide literal. If you subsequently look at the memory, you'll find the UTF-16 representation, yes. (UTF-16LE, to be specific.)
    When I use wide character to multibyte character conversion, should I use UTF-8 code page or some special Czech code page?

    I am not sure for everyong Czech character encoded in wide character, it could be encoded into UTF-8?


    regards,
    George

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by George2 View Post
    What the violation do you mean?
    The standard requires that every character of the extended execution character set can be represented in a single wide character. Windows claims that its eecs is Unicode, but the UTF-16 of wchar_t is a variable-width encoding.

    When I use wide character to multibyte character conversion, should I use UTF-8 code page or some special Czech code page?
    That depends on what you need the multibyte string for.

    I am not sure for everyong Czech character encoded in wide character, it could be encoded into UTF-8?
    UTF-8 and UTF-16 are both complete. There is no Unicode character that cannot be represented in them.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,



    Quote Originally Posted by CornedBee View Post
    The standard requires that every character of the extended execution character set can be represented in a single wide character. Windows claims that its eecs is Unicode, but the UTF-16 of wchar_t is a variable-width encoding.
    Correct me if I am wrong. I think on Windows, wide character or UTF-16 is implemented in a way which one character is always 2 bytes -- sizeof (unsigned short). So it should be fixed-width encoding.

    Why you say it is "variable-width encoding"?


    regards,
    George

  12. #12
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    CornedBee just mentioned that wchar_t is variable width, just like the typical char, the 1 byte multi-string.
    Details are unknown to me, but unless wchar_t is 4 bytes (the number of bytes needed to represent all characters in fixed-width), it must be variable-width, or have limited support for all the characters. And that does somehow not look like unicode to me.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  13. #13
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Elysia,


    You mean wchar_t is always 2 bytes and the 2-byte range is not enough to represent all the values in unicode character and 4-byte range is required?

    Quote Originally Posted by Elysia View Post
    CornedBee just mentioned that wchar_t is variable width, just like the typical char, the 1 byte multi-string.
    Details are unknown to me, but unless wchar_t is 4 bytes (the number of bytes needed to represent all characters in fixed-width), it must be variable-width, or have limited support for all the characters. And that does somehow not look like unicode to me.

    regards,
    George

  14. #14
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Yes, unicode represents far more than 65536 characters, so 2 bytes are not enough.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  15. #15
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Elysia,


    I got your point now. If I remember correctly, Windows only implement a part of unicode called UCS-2.

    Quote Originally Posted by Elysia View Post
    Yes, unicode represents far more than 65536 characters, so 2 bytes are not enough.

    regards,
    George

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. fopen and wfopen
    By George2 in forum C Programming
    Replies: 2
    Last Post: 08-20-2007, 09:28 AM
  2. Unicode + Name Resolution
    By Tonto in forum Windows Programming
    Replies: 6
    Last Post: 07-31-2006, 09:24 PM
  3. Replies: 20
    Last Post: 08-21-2005, 07:49 PM
  4. W B : Invalid or incomplete multibyte or wide character
    By SoFarAway in forum C Programming
    Replies: 1
    Last Post: 02-19-2005, 12:40 AM