Thread: tolower and locale

    >> An E with an acute accent, for instance, has more than one codepoint in Unicode
    That's mixing terms a bit. The term "code point" in used to describe a single unit of storage for a particular Unicode encoding. So for UTF8 each byte is code point, for UTF16 each word is a code point etc... It can take multiple code points to represent a single Unicode character.

    >> So those letters would be what -- a single byte outside of the ascii range?
    There are various "code pages" that map an integral value to a character glyph. A single-byte code page would contain a maximum of 256 mappings. For any code page (that I know of) the characters between 0 and 0x7F (127) are always the same. These are you basic "ascii" characters.

    To illustrate, here's a great site that indexes many character glyphs to the values they have under various code pages. Here is "Latin Capital Letter E With Acute" -
    As you see, it can the following single-byte values, depending on code page: 0x90, 0xC9, 0x4A, 0x71, or 0xE0.
    Under Unicode (UTF32 or UCS-4) it's U+000000C9.
    Encoded with UTF8 it's the two bytes 0xC3 0x89.

    Even though UTF8 can use multiple bytes to encode a single character, don't confuse it with a multi-byte code page. UTF8 is just an 8bit encoding of Unicode characters. (So you can kinda think of Unicode as a really big, 32bit "code page".)

    Here are examples of multi-byte code pages in Windows:
    They call is "DBCS" for obvious reasons.


