Thread: tolower and locale

  1. #1
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    tolower and locale

    I figured this might as well be a fresh thread. Does anyone know if I can expect tolower() to perform locale adjusted conversions? I don't have any non-ascii alphabets handy to test this...Ronix has pointed out:
    Quote Originally Posted by Ronix View Post
    I'm fairly sure that most strcasestr implementations do not follow UTF8-rules and simply use tolower / toupper on each single byte to compare, which makes the function unsuitable for UTF-8 strings
    but Codeplug elaborated:
    Quote Originally Posted by Codeplug View Post
    According to GNU LibC manual, strcasestr() is locale dependent.
    Almost implies to me that tolower() itself is locale dependant, which would be great since I can't use strcasestr. The GNU manual has no comment and neither did a couple of quick googles.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    The C standard says:
    If the argument is a character for which isupper is true and there are one or more corresponding characters, as specified by the current locale, for which islower is true, the tolower function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged.
    The blurb for towlower is almost identical.

  3. #3
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by tabstop View Post
    The C standard says:
    Okay!!
    Phew -- except how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.
    Last edited by MK27; 02-03-2009 at 03:20 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by MK27 View Post
    Phew -- except how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.
    That's why I mentioned towlower -- wouldn't you be using wide characters for utf8?

  5. #5
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    Quote Originally Posted by MK27 View Post
    how could this apply to multi-byte utf8 chars? Which I imagine is all those little accented letters.
    I believe the parameter for tolower / toupper must be in the range [0 .. UCHAR_MAX), which would make it useless for multi-byte characters.

    Quote Originally Posted by tabstop
    wouldn't you be using wide characters for utf8?
    The standard says nothing about the encoding of wide characters. However, on most Unix platforms, wchar_t represents UTF-32 code points (a property you can check by veryfing the existance of the __STDC_ISO_10646__ macro), whereas on windows, it usually represents UCS-2 or UTF-16. So one could use towlower / towupper and then convert back to UTF-8 according to the host platform.
    Last edited by Ronix; 02-03-2009 at 03:35 PM.

  6. #6
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by tabstop View Post
    That's why I mentioned towlower -- wouldn't you be using wide characters for utf8?
    Oh I'm not using them at all. I thought a wide character was one that actually occupied more screen space. That's fine, since the character count will be the same and I imagine "wide character" alphabets (like ideograms) do not really use upper and lower case. Altho that begs the question: what is towlower for?

    But the romance languages, etc contain a lot of "modified" ascii characters (an e with an accent, etc) which I presume, since they are not part of ASCII, must be UTF-8, and can be capitalized (an E with an accent).
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  7. #7
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    Quote Originally Posted by tabstop View Post
    wouldn't you be using wide characters for utf8?
    They wouldn't be UTF-8 if they were wide chars, they's be UTF-16 or UTF-32...
    "I am probably the laziest programmer on the planet, a fact with which anyone who has ever seen my code will agree." - esbo, 11/15/2008

    "the internet is a scary place to be thats why i dont use it much." - billet, 03/17/2010

  8. #8
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by MK27 View Post
    I figured this might as well be a fresh thread. Does anyone know if I can expect tolower() to perform locale adjusted conversions?
    It wouldn't be worth having a function for it if it was not locale-specific.

    At any rate, I think you're giving yourself an unnecessary headache with this UTF-8 thing. UTF-8 is intended primarily as a transfer encoding. For data processing, it is much easier to deal with unescaped UTF-16.

    IMHO, the first thing you should do with the UTF-8 data is translate it to UTF-16, then life will get a lot easier.

    (There is more than one UTF-16 encoding, and some of them do have escapes, but you are far less likely to encounter them)
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  9. #9
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by cpjust View Post
    They wouldn't be UTF-8 if they were wide chars, they's be UTF-16 or UTF-32...
    Fair enough then (apparently character handling needs to be my next project). For whatever reason, I was thinking the shift status (or whatever it's called) would be used for this sort of thing. Maybe not.

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by brewbuck View Post
    It wouldn't be worth having a function for it if it was not locale-specific.
    I hope so now. I like my own version of strcasestr. However, I don't see how tolower() could possibly work on more than one byte, so I will have to come back to this issue. If strcasestr itself is locale safe, then perhaps it does not use/no longer uses tolower().

    At any rate, I think you're giving yourself an unnecessary headache with this UTF-8 thing. UTF-8 is intended primarily as a transfer encoding.
    Technically, it's postscript which has been filtered through col -b to produce plain text. That means that the single quote normally used in conjugations is replaced with the more proper apostrophe, using I am pretty sure: unicode, utf8.

    I have been consulting the "UTF-8 and Unicode FAQ for Unix/Linux". which implies to me that unicode is a singular standard. It also say that it is in the process of becoming the defacto standard on unix/linux machines, and that some standard commands like ls and applications like vi and emacs had to be re-written because of this.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Quote Originally Posted by MK27 View Post
    Ronix has pointed out:

    but Codeplug elaborated:

    Almost implies to me that tolower() itself is locale dependant, which would be great since I can't use strcasestr.
    Ronix is right. I was wrong when I said that strcasestr() would only work if the current locale (LC_CTYPE) is UTF8.
    The manual says:
    > Like strcasecmp, it is locale dependent how uppercase and lowercase characters are related.

    Looking at the source for strcasestr(), it processes "characters" one byte at a time - making strcasestr() useless with UTF8. However, toupper/tolower are affected by the LC_CTYPE of the current locale. Which means tolower and toupper only really work for single-byte code pages.

    If the locale's LC_CTYPE is UTF8, you would need to do a multi-byte to wide conversion. Then you can use the corresponding wide string functions like wcsstr() instead of strstr() - or towupper() instead of toupper().

    There doesn't seem to be a wide version of strcasestr().

    If your data is not locale dependent (you know it'll always be UTF8), then you use the iconv library to convert to wide strings (UTF32 on *nix).

    gg

  12. #12
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    However, toupper/tolower are affected by the LC_CTYPE of the current locale. Which means tolower and toupper only really work for single-byte code pages.
    I have perhaps misunderstood what an E with an accent could be. So those letters would be what -- a single byte outside of the ascii range?

    Do they have upper case letters in Arabic?
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  13. #13
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by MK27 View Post
    I have perhaps misunderstood what an E with an accent could be. So those letters would be what -- a single byte outside of the ascii range?

    Do they have upper case letters in Arabic?
    You're getting into hairy territory. An E with an acute accent, for instance, has more than one codepoint in Unicode. These characters are distinct, even though they look exactly the same. The definition of a character is more than just what it looks like.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  14. #14
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by MK27 View Post
    I have perhaps misunderstood what an E with an accent could be. So those letters would be what -- a single byte outside of the ascii range?
    There was the other guy today in the forum -- "exception handling" or something like that -- who had exactly that, an E with an accent. In UTF-8, according to the links in that thread, it's two bytes: C3 88.

  15. #15
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by tabstop View Post
    There was the other guy today in the forum -- "exception handling" or something like that -- who had exactly that, an E with an accent. In UTF-8, according to the links in that thread, it's two bytes: C3 88.
    Thanks. I honestly had not noticed that. I now think my project is haunted tho -- yesterday I realized that I was using an array to hold the find coordinates, and that I'd set the array size to 666 arbitrarily, which limited the number of finds. But how's this for an illusory problem: I sat there for twenty minutes after enlarging the array trying to figure out where else that limit was applied, because I keep getting the same result.

    Then I realized that the source code (which I was using as input) really did contain "gtk" (case insensitive) 666 times.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Case insensitive string compare...?
    By cpjust in forum C++ Programming
    Replies: 9
    Last Post: 02-22-2008, 04:44 PM