Here's a pure C99 implementation that requires no extra libraries, to explore the concepts of "multibyte strings" and "wide character strings":
Code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
int main(int argc, char *argv[])
{
wchar_t *wide_str = NULL;
size_t wide_max = 0;
size_t i, len;
int arg;
/* Use user-specified locale settings. */
setlocale(LC_ALL, "");
/* Set standard output to wide character mode. */
fwide(stdout, 1);
/* Show help if invalid command-line parameters. */
if (argc < 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s STRING ...\n", argv[0]);
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
for (arg = 1; arg < argc; arg++) {
/* Length of the multibyte string. */
len = mbstowcs(NULL, argv[arg], 0);
/* Resize the wide string to large enough. */
if (len >= wide_max) {
size_t temp_max = len + 1;
wchar_t *temp_str;
temp_str = realloc(wide_str, temp_max * sizeof wide_str[0]);
if (temp_str == NULL) {
fprintf(stderr, "Out of memory.\n");
return EXIT_FAILURE;
}
wide_str = temp_str;
wide_max = temp_max;
}
/* Convert the multibyte string to wide string. */
len = mbstowcs(wide_str, argv[arg], len);
/* Output the initial string, */
wprintf(L"Input string '%s':\n", argv[arg]);
/* and each of the characters. */
for (i = 0; i < len; i++)
wprintf(L"\t'%lc' = U+%04x = %u\n", wide_str[i], (unsigned int)wide_str[i], (unsigned int)wide_str[i]);
}
return EXIT_SUCCESS;
}
The program does assume you are using UTF-8 (or some other Unicode encoding). If you configure your locale to use something else, then the code points are according to that locale, not Unicode -- but
you should use UTF-8 anyway.
To compile and run above
example.c I recommend using
Code:
gcc -Wall -Wextra -std=c99 example.c -o example
./example 'a턞b'
and it should output
Code:
Input string 'a턞b':
'a' = U+0061 = 97
'턞' = U+1d11e = 119070
'b' = U+0062 = 98
For conversion of multibyte strings between character encodings, you should use the POSIX.1-2001
iconv() facilities. (Add
#define _POSIX_C_SOURCE 200809L before any includes in your program, to avail yourself to all POSIX facilities up to POSIX.1-2008.)
You can use
nl_langinfo(CODESET) (from
langinfo.h) to obtain the character encoding. If you want, I can write a simple POSIX.1-2001 example using iconv and nl_langinfo, on how to properly handle all locales and character sets, while internally storing everything as UTF-8. (Nowadays I don't bother with that for my own programs, unless dealing with legacy data, since UTF-8 everywhere works without any hassles.)
If your program assumes all input strings are multibyte/narrow UTF-8 strings, and all output is multibyte/narrow UTF-8 -- that both input and output is always UTF-8 --,
you do not need to do anything.
Basically, you treat all strings as opaque byte sequences terminated with zero. This is exactly how the Linux kernel handles file names et cetera: it does not care what the other bytes are, as long as there is a suitable number of them. (Aside from zero, the other "specials" are slash "/" (47) and dot "." (46).) (Pseudofilesystem interfaces like proc and sys filesystems are a bit more picky about what they accept, but the basic idea is still the same.)
The fact that some glyphs take more than one byte to print out correctly, does not really matter that much in practice.
If you need character-by-character (or more properly, glyph-by-glyph) output, use the wide facilities. If you want to control the terminal window, use wide curses -- ncursesw: see my recent example
here.
Questions? Comments?