tolower and locale

**MK27** · 02-03-2009

Originally Posted by tabstop

The C standard says:
Okay!!

Phew -- except how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.

**tabstop** · 02-03-2009

Originally Posted by MK27

Phew -- except how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.

That's why I mentioned towlower -- wouldn't you be using wide characters for utf8?

**MK27** · 02-03-2009

Originally Posted by tabstop

That's why I mentioned towlower -- wouldn't you be using wide characters for utf8?

Oh I'm not using them at all. I thought a wide character was one that actually occupied more screen space. That's fine, since the character count will be the same and I imagine "wide character" alphabets (like ideograms) do not really use upper and lower case. Altho that begs the question: what is towlower for?

But the romance languages, etc contain a lot of "modified" ascii characters (an e with an accent, etc) which I presume, since they are not part of ASCII, must be UTF-8, and can be capitalized (an E with an accent).

**cpjust** · 02-03-2009

Originally Posted by tabstop

wouldn't you be using wide characters for utf8?

They wouldn't be UTF-8 if they were wide chars, they's be UTF-16 or UTF-32...

**tabstop** · 02-03-2009

Originally Posted by cpjust

They wouldn't be UTF-8 if they were wide chars, they's be UTF-16 or UTF-32...

Fair enough then (apparently character handling needs to be my next project). For whatever reason, I was thinking the shift status (or whatever it's called) would be used for this sort of thing. Maybe not.

**Ronix** · 02-03-2009

Originally Posted by MK27

how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.

I believe the parameter for tolower / toupper must be in the range [0 .. UCHAR_MAX), which would make it useless for multi-byte characters.

Originally Posted by tabstop

wouldn't you be using wide characters for utf8?

The standard says nothing about the encoding of wide characters. However, on most Unix platforms, wchar_t represents UTF-32 code points (a property you can check by veryfing the existance of the __STDC_ISO_10646__ macro), whereas on windows, it usually represents UCS-2 or UTF-16. So one could use towlower / towupper and then convert back to UTF-8 according to the host platform.

Thread: tolower and locale

Thread Tools

Search Thread

Display

Hybrid View

Similar Threads

Case insensitive string compare...?