char array in UTF-8 ?

**pulio** · 02-11-2010

I'm using gcc (c99) in Debian-linux.

Can somebody clarify why this works ? I thought a char array could only hold ASCII chars.

Since it works, what is wchar_t and unsigned char used for ?

Code:

int main(void)
{
    setlocale(LC_ALL, "en_US.UTF-8");

    char A[16] = "Schöne Grüße";

    printf("%s\n%i\n", A, strlen(A)); // it correctly prints the UTF-8 chars and strlen outputs 15.

   return 0;
}

**Elysia** · 02-11-2010

A char is a storage unit with exactly 1 byte. That's all.
ASCII characters take up only 1 byte, so it fits perfectly into an array of chars.
UTF-8 characters also take up only one byte each, making it fit perfectly into a char-array too. Beware, however, that not all bytes may be actual characters. Some characters may take up more than one byte, so strlen is not defined for UTF-8.
wchar_t is used for UTF-16. And unsigned char is the explicit unsigned variant of char. Nothing more. Usually made to work as a buffer for raw data.

If you would like to learn more, I suggest you study Unicode. You will need a proper library to handle them correctly.

**Ronix** · 02-11-2010

A char array can contain any value representable by a char. In your case, the output is correct because the font of the terminal that it's being displayed on supports those glyphs. strlen, on the other side, simply computes the number of bytes in a string. The compiler / text editor is likely translating "Schöne Grüße" into the corresponding utf8 sequence. If you changed the default encoding, I imagine you'd get different results.

Regarding wchar_t, it's a type that is mostly used in system-specific code. On windows, it represents UTF-16; on Unix it's UCS-4, and in other platforms it may be encoded differently on a locale-basis.

**pulio** · 02-11-2010

Ok, so substituting a UTF-8 char with a html-decimal is correct ?

PHP Code:


[CODE]char *chartodecimal(const char *string) // replaces all occurrences of CHARs to DECIMALs.
{
    conversionpointer = (char *)realloc(conversionpointer, strlen(string) + 1);

    strcpy(conversionpointer, string);

    char original[14][7] = {   "\u00E7"/*ç*/, "\x22"/*"*/,"\u00F1"/*ñ*/,"\u00E4"/*ä*/,"\u00E9"/*é*/,"\u00EB"/*ë*/,"\u00FC"/*ü*/,"\u00E3"/*ã*/,"\u00BA"/*º*/,"\u00AA"/*ª*/,"\u00E1"/*á*/,"\u00F3"/*ó*/,"\u00F8"/*ø*/,"\u00DF"/*ß*/};
    char replacement[14][7] = {"&#231 ;"/*ç*/,"&#34 ;"/*"*/,"&#241 ;"/*ñ*/,"&#228 ;"/*ä*/,"&#233 ;"/*é*/,"&#235 ;"/*ë*/,"&#252 ;"/*ü*/,"&#227 ;"/*ã*/,"&#186 ;"/*º*/,"&#170 ;"/*ª*/,"&#225 ;"/*á*/,"&#243 ;"/*ó*/,"&#248 ;"/*ø*/,"&#223 ;"/*ß*/}; // I placed a space before each ; so the browser could show them

    char *conversionpointeroriginalp = NULL;
    char *buffer = NULL;
    int count = 0;

    while(count < 14)
    {
        conversionpointeroriginalp = strstr(conversionpointer, original[count]);

        while(conversionpointeroriginalp != NULL)
        {
            buffer = (char *)realloc(buffer, strlen(conversionpointer) + 1 + (strlen(replacement[count]) - strlen(original[count])));
            strncpy(buffer, conversionpointer, (size_t)(conversionpointeroriginalp - conversionpointer));

            sprintf(buffer + (conversionpointeroriginalp - conversionpointer), "%s%s", replacement[count], conversionpointeroriginalp + strlen(original[count]));

            conversionpointer = (char *)realloc(conversionpointer, strlen(buffer) + 1);

            strcpy(conversionpointer, buffer);

            conversionpointeroriginalp = strstr(conversionpointeroriginalp, original[count]);
        }

        count++;
    }

    if (buffer != NULL)
    {
        free(buffer);
    }

    return conversionpointer;
}[/CODE]

**Elysia** · 02-11-2010

Err? HTML is not a codepage. HTML uses a codepage.

**pulio** · 02-11-2010

Originally Posted by Elysia

Err? HTML is not a codepage. HTML uses a codepage.

That function attempts to replace UTF-8 chars with html-decimal-entity strings (which are essentially ASCII strings).

So the question is:
Obviously, an ASCII char can be replaced by an ASCII string but can a UTF-8 char be replaced by a ASCII string like this function attempts to do ?

**Elysia** · 02-11-2010

No. UTF-8 is variable length and can represent far more characters than ASCII.
It may work, but don't rely on it.

**pulio** · 02-11-2010

Originally Posted by Elysia

No. UTF-8 is variable length and can represent far more characters than ASCII.
It may work, but don't rely on it.

Ok, so what method would you suggest me to use ? I wish to code my own method rather than use a library.

**Dino** · 02-11-2010

Originally Posted by Elysia

...
UTF-8 characters also take up only one byte each, making it fit perfectly into a char-array too. Beware, however, that not all bytes may be actual characters. Some characters may take up more than one byte, so strlen is not defined for UTF-8.

The first part of this statement is not correct. Some UTF-8 characters only take up one byte. Those characters above 0x7F, take up 2 bytes. Others 3, while still others, 4 bytes.

UTF-8 - Wikipedia, the free encyclopedia

**Elysia** · 02-11-2010

That is exactly what I mentioned.

Originally Posted by pulio

Ok, so what method would you suggest me to use ? I wish to code my own method rather than use a library.

If you don't want to use a library, then you will simply have to learn how UTF-8 works and create your own code for handling it.

**EVOEx** · 02-11-2010

Originally Posted by Dino

The first part of this statement is not correct. Some UTF-8 characters only take up one byte. Those characters above 0x7F, take up 2 bytes. Others 3, while still others, 4 bytes.

UTF-8 - Wikipedia, the free encyclopedia

The two of you are referring to the same thing, rather from different perspectives. Elysia meant that UTF-8 is based on 1-byte arrays, as he states that one character may take more than one of these arrays.
Though I don't agree strlen is not defined for UTF-8. It just won't do exactly what you might expect: it doesn't return the number of characters in the string, but it will return the number of bytes in the string. This can be very useful, however, in some cases. For instance you can write a UTF-8 to a file and treat it like regular ASCII, and write strlen(str) bytes.
This is because UTF-8, like ASCII, is guaranteed not to contains 0-bytes. So the 0-byte that is used to terminate ASCII strings also always terminate UTF-8 strings.

**Elysia** · 02-11-2010

That is, strlen is still undefined for UTF-8, because the standard does not say it returns the length of a UTF-8 string, does it?
However, its behavior is properly defined because one char equals exactly one character in ASCII, hence the number of characters == number of bytes - 1 for ASCII strings. You can abuse this behavior to find out the number of bytes in a string, but strictly speaking, the behavior is still undefined or won't work properly on UTF-8 for its intended purpose: to return the string length.

**iMalc** · 02-12-2010

Obligatory links:
Variable-width encoding - Wikipedia, the free encyclopedia
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

The main thing you need to know is that a character in UTF-8 is represented by one or more bytes. The other is that t is complex that you cannot realistically expect to deal with it without using other libraries.

Thread: char array in UTF-8 ?

Thread Tools

Search Thread

Display

char array in UTF-8 ?

Similar Threads

Why return malloc'd char array not work, but local char array does?

code condensing

code help required

Creating 2D arrays on heap

Strings are V important...