-
UTF-8 string literals
What's the best way of embedding UTF-8 (specifically UTF-8, not Unicode) literals in strings in C? You can, of course, do them as byte (char) sequences, but that makes it impossible to write
Code:
char * str = "The first letter of the Greek alphabet is " ALPHA ".";
Basically, I want to be able to write
Code:
#define ALPHA "<something>"
-
I should probably add, since this is platform-dependent, that I'm using GCC on OS X and FreeBSD.
-
I can see two ways:
1. declare utf-8 string as a byte array:
Code:
unsigned char utf8[] = { 'a', 'b', 0 };
2. declare string as wide char string, then convert to multibyte.
-
That doesn't work. As I said, representing them isn't the problem. Reresenting them as a string literal is. Can they be written as "<something>"?
-
Venturing into new territory myself, but isn't it something like "\u0139"?
-
Using '\U' or '\u' is C99's support for Universal Character Names (UCN).
The character set used is the Universal Character Set (UCS), as described by ISO/IEC 10646.
You can also use 'u' or 'U' to begin a string literal (much like 'L').
When you use 'u', you get a UTF-16 encoded character or string literal.
When you use 'U', you get a UTF-32 encoded character or string literal.
When you use 'L', you get a "wide" character or string. The character set used in implementation defined.
You use wchar_t for all of those.
If you want portable, UTF-8 encoding in a string literal, you'll have to encode it yourself with escape sequences "\xhhh".
http://www.cl.cam.ac.uk/~mgk25/unicode.html
Having said all that, here's gcc's "defined implementation"
http://gcc.gnu.org/onlinedocs/gcc-4....fined-behavior
If you lookup the "-fexec-charset", you'll find that UTF-8 is already the default. So if you have a Greek "Alpha" on your keyboard, make your source files UTF-8 and put the real character in there (gcc supports UTF-8 source files). If you're source files are in ASCII, you'll have to use escape sequences or load your strings from an external source.
gg
-
That's great, exactly what I was looking for, thanks!