Thread: get wide character and multibyte character value

  1. #16
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    I'm pretty sure Windows claims to support the full Unicode range.

    And yeah, UTF-16 is a variable-width encoding. The characters from Unicode's basic multilingual plane are 2 bytes large, all others are represented with surrogate pairs, two 16-bit units, and are therefore 4 bytes large.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  2. #17
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Hi CornedBee,


    Code is more powerful than descriptions. Could you let us know which character on Windows will be represented as 4 bytes please, under wide character format -- i.e. L with string literal. :-)

    Sorry for my challenging question.

    Quote Originally Posted by CornedBee View Post
    I'm pretty sure Windows claims to support the full Unicode range.

    And yeah, UTF-16 is a variable-width encoding. The characters from Unicode's basic multilingual plane are 2 bytes large, all others are represented with surrogate pairs, two 16-bit units, and are therefore 4 bytes large.

    regards,
    George

  3. #18
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You're wrong. In this case, code is not more powerful that descriptions.

    Here's a list of tables:
    http://en.wikipedia.org/wiki/Mapping...ode_characters
    Scroll all the way down.


    Here's some:
    http://www.decodeunicode.org/en/cjk_...hs_extension_b
    Last edited by CornedBee; 01-25-2008 at 08:53 AM.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  4. #19
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    A colleague of mine has the Unicode 5.0 standard document - it's a book about 3"/7.5 cm thick - I haven't looked inside it. So I expect that any "few pages web-site" or "a few lines of code" will not describe this problem properly.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #20
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Sorry CornedBee,


    My question is confusing. What I mean is not UTF-16, but Windows' implementation/representation of UTF-16 -- using wchar_t and it is type defined to unsigned short -- onoy 2 bytes.

    My question should be, Windows uses only 2 bytes to represent UTF-16 and can not represent all of the UTF-16, right?

    BTW: what you provide is the general theory of UTF-16, but what I am interested is how Windows represent and limitation of Windows. How could an unsigned short represent 4 bytes to cover all characters? :-)

    Quote Originally Posted by CornedBee View Post
    You're wrong. In this case, code is not more powerful that descriptions.

    Here's a list of tables:
    http://en.wikipedia.org/wiki/Mapping...ode_characters
    Scroll all the way down.


    Here's some:
    http://www.decodeunicode.org/en/cjk_...hs_extension_b

    regards,
    George

  6. #21
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You are confused. UTF-16 is a variable width encoding. That means it can be longer if necessary, just like UTF-8. A 16-bit value cannot represent all unicode characters (which is around 100.000 iirc), so it's variable width. Usually, the first byte will tell how long the character value is (is it 1, 2 or 3 bytes?). So the application processing the unicode string will interpret the first or second byte (I have no idea how UTF-16 works), and then read the applicable amount of characters that is needed to represent the entire unicode value.
    Does that make any sense?
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #22
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks Elysia,


    It makes senses about general UTF-16 theory.

    But on Windows, as we know, wide character (UTF-16LE form) is represented by wchar_t, unsigned short, whose length is 2-byte.

    Do you think on Windows 2-byte is enough to represent all the UTF-16 encoded character?

    Quote Originally Posted by Elysia View Post
    You are confused. UTF-16 is a variable width encoding. That means it can be longer if necessary, just like UTF-8. A 16-bit value cannot represent all unicode characters (which is around 100.000 iirc), so it's variable width. Usually, the first byte will tell how long the character value is (is it 1, 2 or 3 bytes?). So the application processing the unicode string will interpret the first or second byte (I have no idea how UTF-16 works), and then read the applicable amount of characters that is needed to represent the entire unicode value.
    Does that make any sense?

    regards,
    George

  8. #23
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    *bangs head against desk*

    Look, just because a single unit is 2 bytes, doesn't mean that every character is 2 bytes. Surrogate pairs are two units (4 bytes) that together form a character.
    That's why it's called a variable-width encoding.

    Why would 2 bytes magically be able to hold more on Windows? What kind of absurd thinking is that?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #24
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,


    I think you mean using two unsigned short (4 bytes) to represent a character in UTF-16LE encoding, right?

    As far as I know, Windows does not. However, I am not an expert, and it is appreciated if you could point out some sample characters which on Windows requires 4 bytes other than 2 bytes in the form of wide (UTF-16LE) character.

    Quote Originally Posted by CornedBee View Post
    *bangs head against desk*

    Look, just because a single unit is 2 bytes, doesn't mean that every character is 2 bytes. Surrogate pairs are two units (4 bytes) that together form a character.
    That's why it's called a variable-width encoding.

    Why would 2 bytes magically be able to hold more on Windows? What kind of absurd thinking is that?

    regards,
    George

  10. #25
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You're not getting the point. It's variable width. That means a character can be 1, 2, or even 4 bytes. It all depends on what type of character it's trying to represent!
    If all characters were 4 bytes, then it would be fixed width.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #26
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Hi Elysia,


    I have not made myself understood.

    Your point is, UTF-16 encoded character could be represented either by 2-bytes or by 4-bytes format (it is so called variable length). The "16" means the encoding will be represented by 16-bit basis -- 16 bits, 32-bits, etc.

    My point is, on Windows UTF-16 encoding seems to have limitations (Windows only uses 2-byte subset of UTF-16LE), and Windows is not able to represent UTF-16 character encoding in 4 bytes, so only a subset of UTF-16 encoding is supported. Do you agree?

    I am not sure whether this time I have made myself understood? :-)

    Quote Originally Posted by Elysia View Post
    You're not getting the point. It's variable width. That means a character can be 1, 2, or even 4 bytes. It all depends on what type of character it's trying to represent!
    If all characters were 4 bytes, then it would be fixed width.

    regards,
    George

  12. #27
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Windows has basic support for surrogate pairs since Windows 2000.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  13. #28
    Registered User
    Join Date
    May 2006
    Posts
    1,579
    Thanks CornedBee,


    It should be my memory error. :-)

    Quote Originally Posted by CornedBee View Post
    Windows has basic support for surrogate pairs since Windows 2000.

    regards,
    George

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. fopen and wfopen
    By George2 in forum C Programming
    Replies: 2
    Last Post: 08-20-2007, 09:28 AM
  2. Unicode + Name Resolution
    By Tonto in forum Windows Programming
    Replies: 6
    Last Post: 07-31-2006, 09:24 PM
  3. Replies: 20
    Last Post: 08-21-2005, 07:49 PM
  4. W B : Invalid or incomplete multibyte or wide character
    By SoFarAway in forum C Programming
    Replies: 1
    Last Post: 02-19-2005, 12:40 AM