UTF-8 characters

**guraknugen** · 02-21-2015

I'm trying to understand how to write UTF-8 compatible C programs. I have seen that there are tons of text about it out there, maybe a little too much. It's hard to find exactly what I need to know for specific tasks…

One important thing, I guess, is to be able to manage UTF-8 characters character by character, that is 'a' is one character and '𝄞' is another character, even if '𝄞' needs more bytes.

So, just to start exploring this, I wrote the following test, which ”translates” the user input to hex values. This example does however not use proper UTF-8 code, so the output is not correct with characters like '𝄞'. What I want to know is how to correct this. Here's my example code, CharTest.c:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>


int main(int argc, char *argv[])
{
    if(argc!=2) {
        fputs("Exactly one argument was expected.\n", stderr);
        return 1;
    }
    
    // Great, we have exactly one parameter, let's start to work with it.
    // First, make sure we use the current locale:
    if (!setlocale(LC_CTYPE, "")) {
        fprintf(stderr, "Can't set the specified locale!\n"
            "Check LANG, LC_CTYPE and LC_ALL.\n");
        return 1;
    }

     // Now, let's work.
    unsigned char *UserInput;
    
    UserInput=argv[1];
    printf("You wrote ”%s”!\n", UserInput);
    for(int i=0; i<strlen(UserInput); i++) {
        printf("%c\t%X\n", UserInput[i], UserInput[i]);
    }
    return 0;
}

Compile:

Code:

gcc -std=gnu99 CharTest.c -o CharTest

Run:

Code:

./CharTest a𝄞b
You wrote ”a𝄞b”!
a    61
�    F0
�    9D
�    84
�    9E
b    62

Obviously I don't fully understand the mechanism behind this, because '𝄞' is 1D11E, but here it looks like F09D849E.
How can I correct the program to get the following output?

Code:

./CharTest a𝄞b
You wrote ”a𝄞b”!
a    61
𝄞    1D11E
b    62

Obviously I need the for loop to increment the string by one UTF-8 character rather than by one ”char”, but how? And how to get the UTF-8 code properly?
And also, strlen() will not output the correct number of UTF-8 characters, but what will? Are there some useful libraries around that I obviously don't know about…? I don't want to reinvent any wheels…

And just in case someone wonders: Yes, I have 𝄞 on my keyboard (AltGr+Shift+g, using my own keyboard layout)… If not, Ctrl+Shift+u <release> 1d11e <space> would do the trick (in Linux, otherwise I don't know).

**guraknugen** · 02-21-2015

Funny. I have tried to find information about this for ∞ (almost…) and now when I finally ask for it on a forum I find the answer by just ”googling” for a few minutes…

I found this working example, that does something similar, let's call it unicodetest.c:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <unicode/utf.h>
#include <unicode/ustring.h>


int main(int argc, char **argv)
{
    const char s[] = "日本語";


    UChar32 c;
    int32_t k;
    int32_t len = strlen(s);


    for (k = 0; k < len;) {
        U8_NEXT(s, k, len, c);
        printf("%d - %x\n", k, c);
    }
    return 0;
}

To make this actually work, the packages libicu-dev and icu-devtools needs to be installed (in Ubuntu 14.04). This command installs both of them:

Code:

sudo apt-get install libicu-dev

One way to compile the file:

Code:

gcc unicodetest.c -o unicodetest $(icu-config --ldflags --ldflags-icuio)

Run:

Code:

./unicodetest
3 - 65e5
6 - 672c
9 - 8a9e

Trying to modify my own program:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <stdint.h>
#include <unicode/utf.h>
#include <unicode/ustring.h>


int main(int argc, char *argv[])
{
    if(argc!=2) {
        fputs("Exactly one argument was expected.\n", stderr);
        return 1;
    }
    
    char *UserInput;
    UserInput=argv[1];

    printf("You wrote ”%s”!\n", UserInput);

    UChar32 c;
    int32_t len=strlen(UserInput);

    for(int32_t i=0; i<len;) {
        U8_NEXT(UserInput, i, len, c);
        printf("%d\t%X\n", i, c);
    }
    return 0;
}

Compile:

Code:

gcc CharTest.c -std=gnu99 -o CharTest $(icu-config --ldflags --ldflags-icuio)

Run:

Code:

./CharTest a𝄞b
You wrote ”a𝄞b”!
1    61
5    1D11E
6    62

The remaining question is how to print the actual character with printf. I guess I'll study this some more, but help is appreciated.

**Nominal Animal** · 02-21-2015

Here's a pure C99 implementation that requires no extra libraries, to explore the concepts of "multibyte strings" and "wide character strings":

Code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>

int main(int argc, char *argv[])
{
    wchar_t *wide_str = NULL;
    size_t   wide_max = 0;
    size_t   i, len;
    int      arg;

    /* Use user-specified locale settings. */
    setlocale(LC_ALL, "");

    /* Set standard output to wide character mode. */
    fwide(stdout, 1);

    /* Show help if invalid command-line parameters. */
    if (argc < 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s STRING ...\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    for (arg = 1; arg < argc; arg++) {

        /* Length of the multibyte string. */
        len = mbstowcs(NULL, argv[arg], 0);

        /* Resize the wide string to large enough. */
        if (len >= wide_max) {
            size_t   temp_max = len + 1;
            wchar_t *temp_str;

            temp_str = realloc(wide_str, temp_max * sizeof wide_str[0]);
            if (temp_str == NULL) {
                fprintf(stderr, "Out of memory.\n");
                return EXIT_FAILURE;
            }

            wide_str = temp_str;
            wide_max = temp_max;
        }

        /* Convert the multibyte string to wide string. */
        len = mbstowcs(wide_str, argv[arg], len);        

        /* Output the initial string, */
        wprintf(L"Input string '%s':\n", argv[arg]);

        /* and each of the characters. */
        for (i = 0; i < len; i++)
            wprintf(L"\t'%lc' = U+%04x = %u\n", wide_str[i], (unsigned int)wide_str[i], (unsigned int)wide_str[i]);        
    }

    return EXIT_SUCCESS;
}

The program does assume you are using UTF-8 (or some other Unicode encoding). If you configure your locale to use something else, then the code points are according to that locale, not Unicode -- but you should use UTF-8 anyway.

To compile and run above example.c I recommend using

Code:

gcc -Wall -Wextra -std=c99 example.c -o example
./example 'a𝄞b'

and it should output

Code:

Input string 'a𝄞b':
	'a' = U+0061 = 97
	'𝄞' = U+1d11e = 119070
	'b' = U+0062 = 98

For conversion of multibyte strings between character encodings, you should use the POSIX.1-2001 iconv() facilities. (Add #define _POSIX_C_SOURCE 200809L before any includes in your program, to avail yourself to all POSIX facilities up to POSIX.1-2008.)

You can use nl_langinfo(CODESET) (from langinfo.h) to obtain the character encoding. If you want, I can write a simple POSIX.1-2001 example using iconv and nl_langinfo, on how to properly handle all locales and character sets, while internally storing everything as UTF-8. (Nowadays I don't bother with that for my own programs, unless dealing with legacy data, since UTF-8 everywhere works without any hassles.)

If your program assumes all input strings are multibyte/narrow UTF-8 strings, and all output is multibyte/narrow UTF-8 -- that both input and output is always UTF-8 --, you do not need to do anything.

Basically, you treat all strings as opaque byte sequences terminated with zero. This is exactly how the Linux kernel handles file names et cetera: it does not care what the other bytes are, as long as there is a suitable number of them. (Aside from zero, the other "specials" are slash "/" (47) and dot "." (46).) (Pseudofilesystem interfaces like proc and sys filesystems are a bit more picky about what they accept, but the basic idea is still the same.)

The fact that some glyphs take more than one byte to print out correctly, does not really matter that much in practice.

If you need character-by-character (or more properly, glyph-by-glyph) output, use the wide facilities. If you want to control the terminal window, use wide curses -- ncursesw: see my recent example here.

Questions? Comments?

**Codeplug** · 02-21-2015

>> but here it looks like F09D849E.
Here is one of the codepoint reference sites I like to use: Unicode Character 'MUSICAL SYMBOL G CLEF' (U+1D11E)

You should also be aware of some things relating to using extended characters directly in your source (lots of caveats on Windows if your interested): Non-English characters with cout

gg

**WoodSTokk** · 02-21-2015

Originally Posted by guraknugen

Code:

gcc -std=gnu99 CharTest.c -o CharTest

Run:

Code:

./CharTest a��b
You wrote ”a��b”!
a    61
�    F0
�    9D
�    84
�    9E
b    62

Obviously I don't fully understand the mechanism behind this, because '��' is 1D11E, but here it looks like F09D849E.

This comes from UTF-8 coding.
Your input stream is single byte oriented.
So, if a caracter in an single byte stream that needs more then one byte, it must be coded. UTF-8 take the most significant bit as a flag.
This means, all characters with value 1 (0000 0001) up to 127 (0111 1111) are treaded as normal ASCII-Code.
If a byte has the most significant bit set, it is part of a multi byte caracter. The first byte of the multi byte character give's also a counter how many byte the caracter have.

If we have one byte, it is a single caracter (1-127).
If a character needs 2 byte, the UTF-8 coding looks like this:
110x xxxx 10xx xxxx
From the most sicnificant bit to the lower one, we have 2 bits set.
This means the multi byte character is stored in 2 byte begining from this byte. The following byte have only the most significant bit set.
This means the byte is part of a multi byte character.
All bits that i show as blue 'x' are for coding the character.
Coding range is from 128 [00 00 80] to 2047[00 07 FF].

If a character needs 3 byte, the UTF-8 coding looks like this:
1110 xxxx 10xx xxxx 10xx xxxx
Coding range is from 2048 [00 08 00] to 65535 [00 FF FF].

And finaly the character that needs 4 byte:
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Coding range is from 65536 [01 00 00] to 2097151 [1F FF FF], but unicode goes only up to [10 FF FF].

Now, your glyph has the value [01 D1 1E]:
xxx0 0001 1101 0001 0001 1110
(only the blue range is used by unicode)

In UTF-8 coding we need 4 byte [F0 9D 84 9E]:
1111 0000 1001 1101 1000 0100 1001 1110

Voila, this is why you see this hex-digits.

**guraknugen** · 02-21-2015

Originally Posted by WoodSTokk

This comes from UTF-8 coding.
Your input stream is single byte oriented.
So, if a caracter in an single byte stream that needs more then one byte, it must be coded. UTF-8 take the most significant bit as a flag.
This means, all characters with value 1 (0000 0001) up to 127 (0111 1111) are treaded as normal ASCII-Code.
If a byte has the most significant bit set, it is part of a multi byte caracter. The first byte of the multi byte character give's also a counter how many byte the caracter have.

If we have one byte, it is a single caracter (1-127).
If a character needs 2 byte, the UTF-8 coding looks like this:
110x xxxx 10xx xxxx
From the most sicnificant bit to the lower one, we have 2 bits set.
This means the multi byte character is stored in 2 byte begining from this byte. The following byte have only the most significant bit set.
This means the byte is part of a multi byte character.
All bits that i show as blue 'x' are for coding the character.
Coding range is from 128 [00 00 80] to 2047[00 07 FF].

If a character needs 3 byte, the UTF-8 coding looks like this:
1110 xxxx 10xx xxxx 10xx xxxx
Coding range is from 2048 [00 08 00] to 65535 [00 FF FF].

And finaly the character that needs 4 byte:
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Coding range is from 65536 [01 00 00] to 2097151 [1F FF FF], but unicode goes only up to [10 FF FF].

Now, your glyph has the value [01 D1 1E]:
xxx0 0001 1101 0001 0001 1110
(only the blue range is used by unicode)

In UTF-8 coding we need 4 byte [F0 9D 84 9E]:
1111 0000 1001 1101 1000 0100 1001 1110

Voila, this is why you see this hex-digits.

That was actually one part of the whole thing that I didn't realise…! Thanks!

I'll reply to the other replies tomorrow, need some sleep now…

**WoodSTokk** · 02-21-2015

Now for the second problem. Nominal Animal has allready mentioned that if you only read and write the multi byte string, you can use the str-functions and char-type.
I'm also work with multi byte strings (MBS) an i allways read it in as wide character string (WCS).
For WCS, there exists the same functions as for single byte strings with a little bit differend name (as example 'strlen' -> 'wcslen').
Read this code and test it:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>


int main(int argc, char *argv[])
{
    if(argc != 2) {
        fwprintf(stderr, L"Exactly one argument was expected.\n");
        return 1;
    }


    // Great, we have exactly one parameter, let's start to work with it.
    // First, make sure we use the current locale:
    if (!setlocale(LC_CTYPE, "")) {
        fwprintf(stderr, L"Can't set the specified locale!\n"
            L"Check LANG, LC_CTYPE and LC_ALL.\n");
        return 1;
    }


    // Now, let's work.
    wchar_t UserInput[1024];
    swprintf(UserInput, 1024, L"%s", argv[1]);


    wprintf(L"You wrote ”%ls”!\n", UserInput);
    for(int i = 0 ; i < wcslen(UserInput) ; i++) {
        wprintf(L"%lc\t%.6X\n", UserInput[i], UserInput[i]);
    }
    return 0;
}

Important: a stream (also stdin, stdout and stderr) have no orientation as long nothing is read or write.
The first action (read/write) fixed the orientation. If you write on stdout with 'printf', the stream is MSB-oriented.
If you later write on stdout with 'wprintf', you will nothing see, because the mismatch.
So, if you write a program that work with wide characters, you should allways use the wide-functions.

Hint: the formatstring for MBS is '%s' and the formatstring for WCS is '%ls'.
Also the formatstring for character is '%c' and the formatstring for wide character is '%lc'.
This "xxx" is a string literal in MBS an L"xxx" is a string literal in wide format.

**Codeplug** · 02-21-2015

>> swprintf(UserInput, 1024, L"%s", argv[1]);
That should be "%hs".

>> the formatstring for MBS is '%s' ...
Well... "%s" expects that same type of string that is associated with the function. So "%s" to swprintf() expects a wide string.

gg

**guraknugen** · 02-22-2015

Originally Posted by Codeplug

>> but here it looks like F09D849E.
Here is one of the codepoint reference sites I like to use: Unicode Character 'MUSICAL SYMBOL G CLEF' (U+1D11E)

Thanks, that page gives me slightly more information than gnome-character-map.

Originally Posted by Codeplug

You should also be aware of some things relating to using extended characters directly in your source (lots of caveats on Windows if your interested): Non-English characters with cout

gg

Well, I'm not very interested in Windows, I left it behind in 2008 I think, but I enjoyed the link anyway…

**WoodSTokk** · 02-22-2015

Originally Posted by Codeplug

>> the formatstring for MBS is '%s' ...
Well... "%s" expects that same type of string that is associated with the function. So "%s" to swprintf() expects a wide string.

Are you sure? If you are right this program should output "abc".

Code:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>


int main(int argc, char *argv[])
{
    if (!setlocale(LC_CTYPE, "")) {
        fwprintf(stderr, L"Can't set the specified locale!\n"
            L"Check LANG, LC_CTYPE and LC_ALL.\n");
        return 1;
    }

    wchar_t test[1024] = L"abc";
    wprintf(L"Test ”%s”!\n", test);
    return 0;
}

**Nominal Animal** · 02-22-2015

Originally Posted by Codeplug

That should be "%hs". "%s" expects that same type of string that is associated with the function.

Nope. %s is a narrow/multibyte string, and %ls is a wide string.

For references, see man swprintf(3), or C99 7.24.2.1p8.

Neither define what %hs would do, so you must've picked that up from some nonstandard implementation.

**phantomotap** · 02-22-2015

Nope. %s is a narrow/multibyte string, and %ls is a wide string.

Neither define what %hs would do, so you must've picked that up from some nonstandard implementation.

O_o

I thought you knew that Microsoft has a tendency to do things just wrong enough to cause confusion.

Soma

**Nominal Animal** · 02-22-2015

Originally Posted by phantomotap

I thought you knew that Microsoft has a tendency to do things just wrong enough to cause confusion.

I do, but I was trying to be non-aggravating. I'm getting worried that Codeplug thinks I'm sniping at his posts. I'm not. I only post a "Nope" if I feel you are otherwise/generally correct, but this detail needs fixing.

**guraknugen** · 02-22-2015

Originally Posted by Nominal Animal

Here's a pure C99 implementation that requires no extra libraries, to explore the concepts of "multibyte strings" and "wide character strings":

Code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>

int main(int argc, char *argv[])
{
    wchar_t *wide_str = NULL;
    size_t   wide_max = 0;
    size_t   i, len;
    int      arg;

    /* Use user-specified locale settings. */
    setlocale(LC_ALL, "");

    /* Set standard output to wide character mode. */
    fwide(stdout, 1);

    /* Show help if invalid command-line parameters. */
    if (argc < 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s STRING ...\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    for (arg = 1; arg < argc; arg++) {

        /* Length of the multibyte string. */
        len = mbstowcs(NULL, argv[arg], 0);

        /* Resize the wide string to large enough. */
        if (len >= wide_max) {
            size_t   temp_max = len + 1;
            wchar_t *temp_str;

            temp_str = realloc(wide_str, temp_max * sizeof wide_str[0]);
            if (temp_str == NULL) {
                fprintf(stderr, "Out of memory.\n");
                return EXIT_FAILURE;
            }

            wide_str = temp_str;
            wide_max = temp_max;
        }

        /* Convert the multibyte string to wide string. */
        len = mbstowcs(wide_str, argv[arg], len);        

        /* Output the initial string, */
        wprintf(L"Input string '%s':\n", argv[arg]);

        /* and each of the characters. */
        for (i = 0; i < len; i++)
            wprintf(L"\t'%lc' = U+%04x = %u\n", wide_str[i], (unsigned int)wide_str[i], (unsigned int)wide_str[i]);        
    }

    return EXIT_SUCCESS;
}

The program does assume you are using UTF-8 (or some other Unicode encoding). If you configure your locale to use something else, then the code points are according to that locale, not Unicode -- but you should use UTF-8 anyway.

To compile and run above example.c I recommend using

Code:

gcc -Wall -Wextra -std=c99 example.c -o example
./example 'a턞b'

and it should output

Code:

Input string 'a턞b':
    'a' = U+0061 = 97
    '턞' = U+1d11e = 119070
    'b' = U+0062 = 98

For conversion of multibyte strings between character encodings, you should use the POSIX.1-2001 iconv() facilities. (Add #define _POSIX_C_SOURCE 200809L before any includes in your program, to avail yourself to all POSIX facilities up to POSIX.1-2008.)

You can use nl_langinfo(CODESET) (from langinfo.h) to obtain the character encoding. If you want, I can write a simple POSIX.1-2001 example using iconv and nl_langinfo, on how to properly handle all locales and character sets, while internally storing everything as UTF-8. (Nowadays I don't bother with that for my own programs, unless dealing with legacy data, since UTF-8 everywhere works without any hassles.)

If your program assumes all input strings are multibyte/narrow UTF-8 strings, and all output is multibyte/narrow UTF-8 -- that both input and output is always UTF-8 --, you do not need to do anything.

Basically, you treat all strings as opaque byte sequences terminated with zero. This is exactly how the Linux kernel handles file names et cetera: it does not care what the other bytes are, as long as there is a suitable number of them. (Aside from zero, the other "specials" are slash "/" (47) and dot "." (46).) (Pseudofilesystem interfaces like proc and sys filesystems are a bit more picky about what they accept, but the basic idea is still the same.)

The fact that some glyphs take more than one byte to print out correctly, does not really matter that much in practice.

If you need character-by-character (or more properly, glyph-by-glyph) output, use the wide facilities. If you want to control the terminal window, use wide curses -- ncursesw: see my recent example here.

Questions? Comments?

Thanks! Works perfectly, even when I compile it as GNU99. I'm not sure exactly what the difference is, but I'm sure some things are…
I will learn a lot from this, I'm sure.

**Codeplug** · 02-22-2015

I expect to be corrected when I wrong

My career has been MS oriented long enough for me to completely forget that that's how %s works.

gg

Thread: UTF-8 characters

Thread Tools

Search Thread

Display

UTF-8 characters

Similar Threads

Characters again

characters

Characters

The name of some characters

DOS Characters