get wide character and multibyte character value

**CornedBee** · 01-25-2008

I'm pretty sure Windows claims to support the full Unicode range.

And yeah, UTF-16 is a variable-width encoding. The characters from Unicode's basic multilingual plane are 2 bytes large, all others are represented with surrogate pairs, two 16-bit units, and are therefore 4 bytes large.

**George2** · 01-25-2008

Hi CornedBee,

Code is more powerful than descriptions. Could you let us know which character on Windows will be represented as 4 bytes please, under wide character format -- i.e. L with string literal. :-)

Sorry for my challenging question.

Originally Posted by CornedBee

I'm pretty sure Windows claims to support the full Unicode range.

And yeah, UTF-16 is a variable-width encoding. The characters from Unicode's basic multilingual plane are 2 bytes large, all others are represented with surrogate pairs, two 16-bit units, and are therefore 4 bytes large.

regards,
George

**CornedBee** · 01-25-2008

You're wrong. In this case, code is not more powerful that descriptions.

Here's a list of tables:
http://en.wikipedia.org/wiki/Mapping...ode_characters
Scroll all the way down.

Here's some:
http://www.decodeunicode.org/en/cjk_...hs_extension_b

**matsp** · 01-25-2008

A colleague of mine has the Unicode 5.0 standard document - it's a book about 3"/7.5 cm thick - I haven't looked inside it. So I expect that any "few pages web-site" or "a few lines of code" will not describe this problem properly.

--
Mats

**George2** · 01-25-2008

Sorry CornedBee,

My question is confusing. What I mean is not UTF-16, but Windows' implementation/representation of UTF-16 -- using wchar_t and it is type defined to unsigned short -- onoy 2 bytes.

My question should be, Windows uses only 2 bytes to represent UTF-16 and can not represent all of the UTF-16, right?

BTW: what you provide is the general theory of UTF-16, but what I am interested is how Windows represent and limitation of Windows. How could an unsigned short represent 4 bytes to cover all characters? :-)

Originally Posted by CornedBee

You're wrong. In this case, code is not more powerful that descriptions.

Here's a list of tables:
http://en.wikipedia.org/wiki/Mapping...ode_characters
Scroll all the way down.

Here's some:
http://www.decodeunicode.org/en/cjk_...hs_extension_b

regards,
George

**Elysia** · 01-26-2008

You are confused. UTF-16 is a variable width encoding. That means it can be longer if necessary, just like UTF-8. A 16-bit value cannot represent all unicode characters (which is around 100.000 iirc), so it's variable width. Usually, the first byte will tell how long the character value is (is it 1, 2 or 3 bytes?). So the application processing the unicode string will interpret the first or second byte (I have no idea how UTF-16 works), and then read the applicable amount of characters that is needed to represent the entire unicode value.
Does that make any sense?

**George2** · 01-27-2008

Thanks Elysia,

It makes senses about general UTF-16 theory.

But on Windows, as we know, wide character (UTF-16LE form) is represented by wchar_t, unsigned short, whose length is 2-byte.

Do you think on Windows 2-byte is enough to represent all the UTF-16 encoded character?

Originally Posted by Elysia

You are confused. UTF-16 is a variable width encoding. That means it can be longer if necessary, just like UTF-8. A 16-bit value cannot represent all unicode characters (which is around 100.000 iirc), so it's variable width. Usually, the first byte will tell how long the character value is (is it 1, 2 or 3 bytes?). So the application processing the unicode string will interpret the first or second byte (I have no idea how UTF-16 works), and then read the applicable amount of characters that is needed to represent the entire unicode value.
Does that make any sense?

regards,
George

**CornedBee** · 01-27-2008

*bangs head against desk*

Look, just because a single unit is 2 bytes, doesn't mean that every character is 2 bytes. Surrogate pairs are two units (4 bytes) that together form a character.
That's why it's called a variable-width encoding.

Why would 2 bytes magically be able to hold more on Windows? What kind of absurd thinking is that?

**George2** · 01-27-2008

Thanks CornedBee,

I think you mean using two unsigned short (4 bytes) to represent a character in UTF-16LE encoding, right?

As far as I know, Windows does not. However, I am not an expert, and it is appreciated if you could point out some sample characters which on Windows requires 4 bytes other than 2 bytes in the form of wide (UTF-16LE) character.

Originally Posted by CornedBee

*bangs head against desk*

Look, just because a single unit is 2 bytes, doesn't mean that every character is 2 bytes. Surrogate pairs are two units (4 bytes) that together form a character.
That's why it's called a variable-width encoding.

Why would 2 bytes magically be able to hold more on Windows? What kind of absurd thinking is that?

regards,
George

**Elysia** · 01-27-2008

You're not getting the point. It's variable width. That means a character can be 1, 2, or even 4 bytes. It all depends on what type of character it's trying to represent!
If all characters were 4 bytes, then it would be fixed width.

**George2** · 01-27-2008

Hi Elysia,

I have not made myself understood.

Your point is, UTF-16 encoded character could be represented either by 2-bytes or by 4-bytes format (it is so called variable length). The "16" means the encoding will be represented by 16-bit basis -- 16 bits, 32-bits, etc.

My point is, on Windows UTF-16 encoding seems to have limitations (Windows only uses 2-byte subset of UTF-16LE), and Windows is not able to represent UTF-16 character encoding in 4 bytes, so only a subset of UTF-16 encoding is supported. Do you agree?

I am not sure whether this time I have made myself understood? :-)

Originally Posted by Elysia

You're not getting the point. It's variable width. That means a character can be 1, 2, or even 4 bytes. It all depends on what type of character it's trying to represent!
If all characters were 4 bytes, then it would be fixed width.

regards,
George

**CornedBee** · 01-27-2008

Windows has basic support for surrogate pairs since Windows 2000.

**George2** · 01-27-2008

Thanks CornedBee,

It should be my memory error. :-)

Originally Posted by CornedBee

Windows has basic support for surrogate pairs since Windows 2000.

regards,
George

Thread: get wide character and multibyte character value

Thread Tools

Search Thread

Display

Similar Threads

fopen and wfopen

Unicode + Name Resolution

Comparison Operator Comparing Two Values ('y' || 'x' == blah)

W B : Invalid or incomplete multibyte or wide character