tolower and locale

**Codeplug** · 02-03-2009

>> An E with an acute accent, for instance, has more than one codepoint in Unicode
That's mixing terms a bit. The term "code point" in used to describe a single unit of storage for a particular Unicode encoding. So for UTF8 each byte is code point, for UTF16 each word is a code point etc... It can take multiple code points to represent a single Unicode character.

>> So those letters would be what -- a single byte outside of the ascii range?
There are various "code pages" that map an integral value to a character glyph. A single-byte code page would contain a maximum of 256 mappings. For any code page (that I know of) the characters between 0 and 0x7F (127) are always the same. These are you basic "ascii" characters.

To illustrate, here's a great site that indexes many character glyphs to the values they have under various code pages. Here is "Latin Capital Letter E With Acute" - http://www.tachyonsoft.com/uc0000.htm#U00C9
As you see, it can the following single-byte values, depending on code page: 0x90, 0xC9, 0x4A, 0x71, or 0xE0.
Under Unicode (UTF32 or UCS-4) it's U+000000C9.
Encoded with UTF8 it's the two bytes 0xC3 0x89.

Even though UTF8 can use multiple bytes to encode a single character, don't confuse it with a multi-byte code page. UTF8 is just an 8bit encoding of Unicode characters. (So you can kinda think of Unicode as a really big, 32bit "code page".)

Here are examples of multi-byte code pages in Windows: http://www.microsoft.com/globaldev/reference/WinCP.mspx
They call is "DBCS" for obvious reasons.

gg

**MK27** · 02-03-2009

hmmm....

Thread: tolower and locale

Thread Tools

Search Thread

Display

Similar Threads

Case insensitive string compare...?