Thread: UTF-8 string literals

  1. #1
    Registered User
    Join Date
    Jun 2005
    Posts
    28

    UTF-8 string literals

    What's the best way of embedding UTF-8 (specifically UTF-8, not Unicode) literals in strings in C? You can, of course, do them as byte (char) sequences, but that makes it impossible to write
    Code:
    char * str = "The first letter of the Greek alphabet is " ALPHA ".";
    Basically, I want to be able to write
    Code:
    #define ALPHA "<something>"

  2. #2
    Registered User
    Join Date
    Jun 2005
    Posts
    28
    I should probably add, since this is platform-dependent, that I'm using GCC on OS X and FreeBSD.

  3. #3
    Registered User
    Join Date
    Jul 2005
    Location
    Transcarpathia
    Posts
    49
    I can see two ways:
    1. declare utf-8 string as a byte array:
    Code:
     unsigned char utf8[] = { 'a', 'b', 0 };
    2. declare string as wide char string, then convert to multibyte.

  4. #4
    Registered User
    Join Date
    Jun 2005
    Posts
    28
    That doesn't work. As I said, representing them isn't the problem. Reresenting them as a string literal is. Can they be written as "<something>"?

  5. #5
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Venturing into new territory myself, but isn't it something like "\u0139"?
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Using '\U' or '\u' is C99's support for Universal Character Names (UCN).

    The character set used is the Universal Character Set (UCS), as described by ISO/IEC 10646.

    You can also use 'u' or 'U' to begin a string literal (much like 'L').
    When you use 'u', you get a UTF-16 encoded character or string literal.
    When you use 'U', you get a UTF-32 encoded character or string literal.
    When you use 'L', you get a "wide" character or string. The character set used in implementation defined.

    You use wchar_t for all of those.

    If you want portable, UTF-8 encoding in a string literal, you'll have to encode it yourself with escape sequences "\xhhh".

    http://www.cl.cam.ac.uk/~mgk25/unicode.html

    Having said all that, here's gcc's "defined implementation"
    http://gcc.gnu.org/onlinedocs/gcc-4....fined-behavior

    If you lookup the "-fexec-charset", you'll find that UTF-8 is already the default. So if you have a Greek "Alpha" on your keyboard, make your source files UTF-8 and put the real character in there (gcc supports UTF-8 source files). If you're source files are in ASCII, you'll have to use escape sequences or load your strings from an external source.

    gg

  7. #7
    Registered User
    Join Date
    Jun 2005
    Posts
    28
    That's great, exactly what I was looking for, thanks!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. String Class
    By BKurosawa in forum C++ Programming
    Replies: 117
    Last Post: 08-09-2007, 01:02 AM
  2. String issues
    By The_professor in forum C++ Programming
    Replies: 7
    Last Post: 06-12-2007, 09:11 AM
  3. How to concatenate, then widen string literals
    By lonehacker in forum C Programming
    Replies: 13
    Last Post: 09-15-2004, 10:56 AM
  4. Classes inheretance problem...
    By NANO in forum C++ Programming
    Replies: 12
    Last Post: 12-09-2002, 03:23 PM
  5. Warnings, warnings, warnings?
    By spentdome in forum C Programming
    Replies: 25
    Last Post: 05-27-2002, 06:49 PM