How is tolower() implemented by the standard library if alphabets are not guaranteed to be continuous?
How is tolower() implemented by the standard library if alphabets are not guaranteed to be continuous?
Here's one implementation from newlib/libc.
Code:int _DEFUN(tolower,(c),int c) { #if defined (_MB_EXTENDED_CHARSETS_ISO) || defined (_MB_EXTENDED_CHARSETS_WINDOWS) if ((unsigned char) c <= 0x7f) return isupper (c) ? c - 'A' + 'a' : c; else if (c != EOF && MB_CUR_MAX == 1 && isupper (c)) { char s[MB_LEN_MAX] = { c, '\0' }; wchar_t wc; if (mbtowc (&wc, s, 1) >= 0 && wctomb (s, (wchar_t) towlower ((wint_t) wc)) == 1) c = (unsigned char) s[0]; } return c; #else return isupper(c) ? (c) - 'A' + 'a' : c; #endif }
Wow that is complicated.. thanks
You can implement it like this.
However usually the implementation doesn't have to be portable, so you can take advantage of the specific encoding.Code:int tolower(int ch) { const char *upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; const char *lower ="abcdefghijklmnopqrstuvwxyz"; char *ptr = strchr(upper, ch); if( ptr) return lower[ptr - upper]; else return ch; }
I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
Visit my website for lots of associated C programming resources.
https://github.com/MalcolmMcLean
As Malcolm said, an implementation doesn't need to be portable, so it can take advantage of the particular character set.
The newlib implementation isn't as complicated as it seems. If the #else branch is compiled it becomes:
isupper, in the newlib library, is:Code:int tolower (int c) { return isupper(c) ? c - 'A' + 'a' : c; }
which uses a table, a common approach for character typing.Code:int isupper (int c) { return ((__CTYPE_PTR[c+1] & (_U|_L)) == _U); }
Presumably it handles EOF as -1, hence the c + 1.
Part of the table (there's a lot more detail than this, see /newlib/libc/ctype/ctype_.c ) :
Code:#define _U 01 // upper #define _L 02 // lower #define _N 04 // numeric #define _S 010 // whitespace #define _P 020 // punctuation #define _C 040 // control #define _X 0100 // hex #define _B 0200 // blank #define _CTYPE_DATA_0_127 \ _C, _C, _C, _C, _C, _C, _C, _C, \ _C, _C|_S, _C|_S, _C|_S, _C|_S, _C|_S, _C, _C, \ _C, _C, _C, _C, _C, _C, _C, _C, \ _C, _C, _C, _C, _C, _C, _C, _C, \ _S|_B, _P, _P, _P, _P, _P, _P, _P, \ _P, _P, _P, _P, _P, _P, _P, _P, \ _N, _N, _N, _N, _N, _N, _N, _N, \ _N, _N, _P, _P, _P, _P, _P, _P, \ _P, _U|_X, _U|_X, _U|_X, _U|_X, _U|_X, _U|_X, _U, \ _U, _U, _U, _U, _U, _U, _U, _U, \ _U, _U, _U, _U, _U, _U, _U, _U, \ _U, _U, _U, _P, _P, _P, _P, _P, \ _P, _L|_X, _L|_X, _L|_X, _L|_X, _L|_X, _L|_X, _L, \ _L, _L, _L, _L, _L, _L, _L, _L, \ _L, _L, _L, _L, _L, _L, _L, _L, \ _L, _L, _L, _P, _P, _P, _P, _C
Last edited by john.c; 10-09-2020 at 05:15 PM.
The whole problem with the world is that fools and fanatics are always so certain of themselves, but wiser people so full of doubts. - Bertrand Russell
Thanks a lot Malcolm McLean and john.c! So it's better to use the standard library whenever possible huh?
my 200th post. yey.