Thread: char array in UTF-8 ?

  1. #1
    Registered User
    Join Date
    Oct 2009
    Posts
    8

    char array in UTF-8 ?

    I'm using gcc (c99) in Debian-linux.

    Can somebody clarify why this works ? I thought a char array could only hold ASCII chars.

    Since it works, what is wchar_t and unsigned char used for ?

    Code:
    int main(void)
    {
        setlocale(LC_ALL, "en_US.UTF-8");
    
        char A[16] = "Schöne Grüße";
    
        printf("%s\n%i\n", A, strlen(A)); // it correctly prints the UTF-8 chars and strlen outputs 15.
    
       return 0;
    }

  2. #2
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    A char is a storage unit with exactly 1 byte. That's all.
    ASCII characters take up only 1 byte, so it fits perfectly into an array of chars.
    UTF-8 characters also take up only one byte each, making it fit perfectly into a char-array too. Beware, however, that not all bytes may be actual characters. Some characters may take up more than one byte, so strlen is not defined for UTF-8.
    wchar_t is used for UTF-16. And unsigned char is the explicit unsigned variant of char. Nothing more. Usually made to work as a buffer for raw data.

    If you would like to learn more, I suggest you study Unicode. You will need a proper library to handle them correctly.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    A char array can contain any value representable by a char. In your case, the output is correct because the font of the terminal that it's being displayed on supports those glyphs. strlen, on the other side, simply computes the number of bytes in a string. The compiler / text editor is likely translating "Schöne Grüße" into the corresponding utf8 sequence. If you changed the default encoding, I imagine you'd get different results.

    Regarding wchar_t, it's a type that is mostly used in system-specific code. On windows, it represents UTF-16; on Unix it's UCS-4, and in other platforms it may be encoded differently on a locale-basis.

  4. #4
    Registered User
    Join Date
    Oct 2009
    Posts
    8
    Ok, so substituting a UTF-8 char with a html-decimal is correct ?

    PHP Code:
    [CODE]char *chartodecimal(const char *string// replaces all occurrences of CHARs to DECIMALs.
    {
        
    conversionpointer = (char *)realloc(conversionpointerstrlen(string) + 1);

        
    strcpy(conversionpointerstring);

        
    char original[14][7] = {   "\u00E7"/*ç*/"\x22"/*"*/,"\u00F1"/*ñ*/,"\u00E4"/*ä*/,"\u00E9"/*é*/,"\u00EB"/*ë*/,"\u00FC"/*ü*/,"\u00E3"/*ã*/,"\u00BA"/*º*/,"\u00AA"/*ª*/,"\u00E1"/*á*/,"\u00F3"/*ó*/,"\u00F8"/*ø*/,"\u00DF"/*ß*/};
        
    char replacement[14][7] = {"&#231 ;"/*ç*/,"&#34 ;"/*"*/,"&#241 ;"/*ñ*/,"&#228 ;"/*ä*/,"&#233 ;"/*é*/,"&#235 ;"/*ë*/,"&#252 ;"/*ü*/,"&#227 ;"/*ã*/,"&#186 ;"/*º*/,"&#170 ;"/*ª*/,"&#225 ;"/*á*/,"&#243 ;"/*ó*/,"&#248 ;"/*ø*/,"&#223 ;"/*ß*/}; // I placed a space before each ; so the browser could show them

        
    char *conversionpointeroriginalp NULL;
        
    char *buffer NULL;
        
    int count 0;

        while(
    count 14)
        {
            
    conversionpointeroriginalp strstr(conversionpointeroriginal[count]);

            while(
    conversionpointeroriginalp != NULL)
            {
                
    buffer = (char *)realloc(bufferstrlen(conversionpointer) + + (strlen(replacement[count]) - strlen(original[count])));
                
    strncpy(bufferconversionpointer, (size_t)(conversionpointeroriginalp conversionpointer));

                
    sprintf(buffer + (conversionpointeroriginalp conversionpointer), "%s%s"replacement[count], conversionpointeroriginalp strlen(original[count]));

                
    conversionpointer = (char *)realloc(conversionpointerstrlen(buffer) + 1);

                
    strcpy(conversionpointerbuffer);

                
    conversionpointeroriginalp strstr(conversionpointeroriginalporiginal[count]);
            }

            
    count++;
        }

        if (
    buffer != NULL)
        {
            
    free(buffer);
        }

        return 
    conversionpointer;
    }[/
    CODE

  5. #5
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Err? HTML is not a codepage. HTML uses a codepage.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Registered User
    Join Date
    Oct 2009
    Posts
    8
    Quote Originally Posted by Elysia View Post
    Err? HTML is not a codepage. HTML uses a codepage.
    That function attempts to replace UTF-8 chars with html-decimal-entity strings (which are essentially ASCII strings).

    So the question is:
    Obviously, an ASCII char can be replaced by an ASCII string but can a UTF-8 char be replaced by a ASCII string like this function attempts to do ?

  7. #7
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    No. UTF-8 is variable length and can represent far more characters than ASCII.
    It may work, but don't rely on it.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #8
    Registered User
    Join Date
    Oct 2009
    Posts
    8
    Quote Originally Posted by Elysia View Post
    No. UTF-8 is variable length and can represent far more characters than ASCII.
    It may work, but don't rely on it.
    Ok, so what method would you suggest me to use ? I wish to code my own method rather than use a library.

  9. #9
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Quote Originally Posted by Elysia View Post
    ...
    UTF-8 characters also take up only one byte each, making it fit perfectly into a char-array too. Beware, however, that not all bytes may be actual characters. Some characters may take up more than one byte, so strlen is not defined for UTF-8.
    The first part of this statement is not correct. Some UTF-8 characters only take up one byte. Those characters above 0x7F, take up 2 bytes. Others 3, while still others, 4 bytes.

    UTF-8 - Wikipedia, the free encyclopedia
    Mainframe assembler programmer by trade. C coder when I can.

  10. #10
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    That is exactly what I mentioned.
    Quote Originally Posted by pulio View Post
    Ok, so what method would you suggest me to use ? I wish to code my own method rather than use a library.
    If you don't want to use a library, then you will simply have to learn how UTF-8 works and create your own code for handling it.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Quote Originally Posted by Dino View Post
    The first part of this statement is not correct. Some UTF-8 characters only take up one byte. Those characters above 0x7F, take up 2 bytes. Others 3, while still others, 4 bytes.

    UTF-8 - Wikipedia, the free encyclopedia
    The two of you are referring to the same thing, rather from different perspectives. Elysia meant that UTF-8 is based on 1-byte arrays, as he states that one character may take more than one of these arrays.
    Though I don't agree strlen is not defined for UTF-8. It just won't do exactly what you might expect: it doesn't return the number of characters in the string, but it will return the number of bytes in the string. This can be very useful, however, in some cases. For instance you can write a UTF-8 to a file and treat it like regular ASCII, and write strlen(str) bytes.
    This is because UTF-8, like ASCII, is guaranteed not to contains 0-bytes. So the 0-byte that is used to terminate ASCII strings also always terminate UTF-8 strings.

  12. #12
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    That is, strlen is still undefined for UTF-8, because the standard does not say it returns the length of a UTF-8 string, does it?
    However, its behavior is properly defined because one char equals exactly one character in ASCII, hence the number of characters == number of bytes - 1 for ASCII strings. You can abuse this behavior to find out the number of bytes in a string, but strictly speaking, the behavior is still undefined or won't work properly on UTF-8 for its intended purpose: to return the string length.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  13. #13
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Obligatory links:
    Variable-width encoding - Wikipedia, the free encyclopedia
    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

    The main thing you need to know is that a character in UTF-8 is represented by one or more bytes. The other is that t is complex that you cannot realistically expect to deal with it without using other libraries.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 3
    Last Post: 11-17-2008, 12:36 PM
  2. code condensing
    By bcianfrocca in forum C++ Programming
    Replies: 4
    Last Post: 09-07-2005, 09:22 AM
  3. code help required
    By Yobbo in forum C Programming
    Replies: 9
    Last Post: 09-02-2005, 11:15 PM
  4. Creating 2D arrays on heap
    By sundeeptuteja in forum C++ Programming
    Replies: 6
    Last Post: 08-16-2002, 11:44 AM
  5. Strings are V important...
    By NANO in forum C++ Programming
    Replies: 15
    Last Post: 04-14-2002, 11:57 AM