Understanding ctype.h

This is a discussion on Understanding ctype.h within the C Programming forums, part of the General Programming Boards category; I'm having trouble interpreting an inconsistency in the character handling library. The prototype for toupper is "int toupper( int c ...

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    182

    Understanding ctype.h

    I'm having trouble interpreting an inconsistency in the character handling library.

    The prototype for toupper is "int toupper( int c );"
    Why does toupper and tolower return an int data type and not a char datatype? If I make sPtr, and s[ 100 ] of type int it breaks the program yet this doesn't?

    Code:
    /*
    exercise 8.6
    Write a program that inputs a line of text with function gets into char array
    s[ 100 ].  Output the line in uppercase letters and in lowercase letters.
    */
    
    #include <stdio.h>
    #include <ctype.h>
    
    int main()
    {
      char *sPtr;
      char s[ 100 ] = { 0 };
      sPtr = &s[ 0 ];
    
      gets( s );
    
    
      while( *sPtr != '\0' ) {
        printf( "%c", toupper( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      sPtr = &s[ 0 ];
     
      while ( *sPtr != '\0' ) {
        printf( "%c", tolower( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      return 0;
    }

  2. #2
    Captain - Lover of the C
    Join Date
    May 2005
    Posts
    341
    It's for portability. Sometimes characters are 1 bytes. Sometimes they are 2 bytes for extended characters like chinese characters. In the future, it may be even more. So toupper takes an int to make sure that there is no data loss (like if you had to convert a 2 byte character into a 1 byte character).
    Don't quote me on that... ...seriously

  3. #3
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,006
    http://groups.google.com/group/comp....de5e6e1dcf2e4e
    The reason the argument and return types are int is similar to that for getchar() - the function is defined for values representable as unsigned char and also EOF (which has a negative value). Any other value passed to toupper results in undefined behaviour (which is why it is sometimes necessary to case the argument to ctype.h funtions to unsigned char before passing it).
    An example here.

    And don't use gets.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  4. #4
    Registered User
    Join Date
    May 2006
    Posts
    182
    So if I try to printf("%c", integer) it'll print it assuming the integer corresponds to an ASCII character?

    If that's the case, can I then assume that the problem is doing integer = gets() ? Since gets returns type char.

  5. #5
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    Don't use gets. And no, it isn't a problem, since a char fits easily within an integer.
    And yes to your first question, as well.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Registered User
    Join Date
    May 2006
    Posts
    182
    This makes sense logically but why doesn't the following modification work( data type changed from char to int )?

    Code:
    /*
    exercise 8.6
    Write a program that inputs a line of text with function gets into char array
    s[ 100 ].  Output the line in uppercase letters and in lowercase letters.
    */
    
    #include <stdio.h>
    #include <ctype.h>
    
    int main()
    {
      int *sPtr;
      int s[ 100 ] = { 0 };
      sPtr = &s[ 0 ];
    
      gets( s );
    
    
      while( *sPtr != '\0' ) {
        printf( "&#37;c", toupper( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      sPtr = &s[ 0 ];
     
      while ( *sPtr != '\0' ) {
        printf( "%c", tolower( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      return 0;
    }
    Code:
    root[~]# cc -Wall -pedantic 8.6.c
    8.6.c: In function 'main':
    8.6.c:16: warning: passing argument 1 of 'gets' from incompatible pointer type
    /tmp/ccTPmiTC.o: In function `main':
    8.6.c:(.text+0x42): warning: the `gets' function is dangerous and should not be used.
    root[~]# ./a.out
    abcd
    a
    a
    root[~]#
    I'm guessing it might have something to do with the while condition. An integer can't be equal or not equal to '\0' ?

    I know gets isn't safe to use, but it's part of the exercises in my book. I'm using this as an opportunity to learn about data types and functions.
    Last edited by yougene; 03-05-2008 at 12:58 PM.

  7. #7
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    Gets is unsafe and can easily be substituted for fgets.
    fgets(s, sizeof(s), stdin);
    Go ahead. Change it.
    Gets should never have been invented. It should not exist. So consider it doesn't exist and never use it, whatever the reason you have to use it. No excuse.
    And neither gets nor fgets reads an integer. They read a string. That's different.

    If you want an in depths explanation:
    An array is just one contiguous block of memory. In case of char, each element is 1 byte.
    However, fgets just writes directly to the memory, each character is one byte. So in the char array, each element represents one character.
    With your int array, you get 4 characters in one element because int is 4 bytes.

    If you want to test your theory, you can do:
    printf( "&#37;c", toupper( (int)*sPtr ) );
    Or
    int n = (int)*sPtr;
    printf( "%c", toupper(n) );

    This assumes s is a pointer to char and s is an array of char.
    And s is not a good name for a variable either.
    Last edited by Elysia; 03-05-2008 at 01:02 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #8
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,796
    >Sometimes characters are 1 bytes. Sometimes they are
    >2 bytes for extended characters like chinese characters.
    Not really, in more ways than one.

    First, Chinese is a bad example for the two byte explanation because the current Chinese logogram collection is probably pushing 100,000 unique characters. Assuming a byte is the smallest addressable unit, you need at least three bytes to cover the most recent dictionary, which still isn't complete.

    Second, "char" in C is synonymous with "byte". That's not going to change, and that's why the framework around wchar_t exists to support larger character sets than a byte can handle.

    The real reason for why toupper and friends work with int instead of char is consistency throughout the standard library. All of the ctype functions are defined in terms of input based on fgetc. fgetc allows the full range of unsigned char and EOF, which is a negative value. For toupper to support the range of values returned by fgetc, it has to use a signed data type larger than unsigned char.

    >why doesn't the following modification work( data type changed from char to int )?
    Put simply, int is used in the ctype library to support an error flag of EOF. It doesn't mean you can suddenly use int to process characters unless you add the requisite logic to treat single integers as single characters, which isn't the default behavior. For example, this works because I changed gets to a function that expects a pointer to int:
    Code:
    /*
    exercise 8.6
    Write a program that inputs a line of text with function gets into char array
    s[ 100 ].  Output the line in uppercase letters and in lowercase letters.
    */
    
    #include <stdio.h>
    #include <ctype.h>
    
    void jsw_gets ( int *s )
    {
      int ch;
    
      while ( ( ch = getchar() ) != EOF && ch != '\n' )
        *s++ = ch;
    
      *s = '\0';
    }
    
    int main()
    {
      int *sPtr;
      int s[ 100 ] = { 0 };
      sPtr = &s[ 0 ];
    
      jsw_gets( s );
    
      while( *sPtr != '\0' ) {
        printf( "%c", toupper( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      sPtr = &s[ 0 ];
     
      while ( *sPtr != '\0' ) {
        printf( "%c", tolower( *sPtr ) );
        ++sPtr;
      }
    
      printf( "\n" );
    
      return 0;
    }
    My best code is written with the delete key.

  9. #9
    Registered User
    Join Date
    May 2006
    Posts
    182
    int is 4 bytes and char is 1 byte. I pass a char value to toupper and toupper returns an int value. The int value is printed as a char by &#37;c, so what happens to the other 3 bytes? Would I be able to print 4 characters by using %s to display an int value?

    I'm going to have to switch over to fgets but one thing at a time. Apparently I'm doing alot wrong, I got my data types all mixed up, my naming scheme sucks, and I'm using the wrong functions!

  10. #10
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    All arguments < 4 bytes are promoted to 4-bytes values and passed to printf.
    (And floats are promoted to doubles.)
    So when you want to print &#37;c, it actually reads 4 bytes, and interprets it, then displays it.
    You could use %s to display an int, yes, but that is not recommended, due to how strings work.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by yougene View Post
    int is 4 bytes and char is 1 byte. I pass a char value to toupper and toupper returns an int value. The int value is printed as a char by %c, so what happens to the other 3 bytes? Would I be able to print 4 characters by using %s to display an int value?

    I'm going to have to switch over to fgets but one thing at a time. Apparently I'm doing alot wrong, I got my data types all mixed up, my naming scheme sucks, and I'm using the wrong functions!
    A char is, as you say 1 byte. The compiler can translate char into int (it does that by filling the remainging 3 bytes with the upper-most bit of the char [1] - many processors even have a specific instruction for this particular purpose).

    So when you pass a char to toupper() it is automatically "extended" to an int.

    The opposite isn't true, because the compiler won't know that your int only contains char's (and like you can ride a bicycle throug a wide lane on a road, but a bus won't fit on a cycle-lane, the compiler will say "stuff an integer array into a char - no way, it doesn't fit there"). This is why it says "passing argument 1 of 'gets' from incompatible pointer type"

    When it comes to printf(), it has special rules: Since the compiler doesn't understand how the arguments to printf are going to be treated, it passes ALL small integer values (shorter than long) as int, and all float values as double. This is part of the C language standard. So your char value will be translated into an int anyways.

    [1] Unsigned char will be converted by filling the top 3 bytes with zero.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User
    Join Date
    May 2006
    Posts
    182
    Ok I think this is coming together now. The reason the int modification didn't work was due to the way gets enters information into memory. It just takes the starting point of memory and directly enters information regardless of what data type I'm using. The reason the program only gave back a instead of abc is because the rest of the string( "bc" ) was still located in the first element of the array. If I would print that first element using %s I would probably get the rest of the strings( although not recommended ).

  13. #13
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    Quote Originally Posted by yougene View Post
    If I would print that first element using %s I would probably get the rest of the strings( although not recommended ).
    Then you would print the entire string.
    And don't use gets.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  14. #14
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by yougene View Post
    Ok I think this is coming together now. The reason the int modification didn't work was due to the way gets enters information into memory. It just takes the starting point of memory and directly enters information regardless of what data type I'm using. The reason the program only gave back a instead of abc is because the rest of the string( "bc" ) was still located in the first element of the array. If I would print that first element using %s I would probably get the rest of the strings( although not recommended ).
    It will actually work just fine to print with printf("%s", s), because gets() and printf() will both interpret the string as a char array. The fact that you declared it as an int array will only come into effect if you try to step throgh it, then you will get 'abcd' (0x64636261) as the integer value [1] of the first integer - because gets() have stuffed four bytes into the first four bytes of your array.



    [1] or 0x61626364, depending on the "endianness" of the machine.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  15. #15
    Registered User
    Join Date
    May 2006
    Posts
    182
    Running the program using more letters clearly shows that you are right. But now I don't understand why none of the values were converted to uppercase. I have a feeling this is a more complex issue though.
    Code:
    root[~]# ./a.out
    abcdefghijklmnopqrstuvwxyz
    aeimquy
    aeimquy
    root[~]#

    When it comes to printf(), it has special rules: Since the compiler doesn't understand how the arguments to printf are going to be treated, it passes ALL small integer values (shorter than long) as int, and all float values as double. This is part of the C language standard. So your char value will be translated into an int anyways.
    Are you saying that if I do printf("&#37;c", someChar), printf is going to convert someChar into an integer before it works it's magic?

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Need help understanding a problem
    By dnguyen1022 in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2009, 04:21 PM
  2. Understanding Headers
    By AeonMoth in forum C++ Programming
    Replies: 2
    Last Post: 06-27-2007, 05:53 AM
  3. trouble understanding the source file structure
    By Mario F. in forum C++ Programming
    Replies: 5
    Last Post: 05-26-2006, 06:46 PM
  4. understanding recursive functions
    By houler in forum C Programming
    Replies: 7
    Last Post: 12-09-2004, 11:56 AM
  5. Help understanding conditional operator
    By Sereby in forum C Programming
    Replies: 7
    Last Post: 08-09-2004, 12:24 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21