Unicode/UTF-8 -- Going Beyond ASCII

**~~CommonTater~~** · 01-22-2011

Originally Posted by msh

I might be wrong, but I think that UNICODE defines are MSVC-specific.

#define _UNICODE ... C and C++, works on nearly every compiler I've used. It's a switch for certain defines and macros in the libraries.

#define UNICODE ... Windows, automatically switches between ascii and unicode versions of almost every function call documented in the SDK.

I did not realize we were on a non-windows setup... sorry about that.

**Codeplug** · 01-22-2011

Code:

#include <stdio.h>
#include <wchar.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
#include <limits.h>

/* taken from libc manual */
size_t mbslen (const char *s)
{
    mbstate_t state;
    size_t result = 0;
    size_t nbytes;
    memset(&state, '\0', sizeof(state));
    while ((nbytes = mbrlen(s, MB_LEN_MAX, &state)) > 0)
    {
        if (nbytes >= (size_t)-2)
            return (size_t)-1;
        s += nbytes;
        ++result;
    }

    return result;
}

int main(void)
{
    char input[80] = {0};
    wchar_t output[80] = {0};

#if 1
    const char *l = setlocale(LC_CTYPE, "");
    if (!l)
        perror("setlocale");
    else
        printf("locale = %s\n", l);
#else
    puts("Using C locale");
#endif

    printf("Give me something: ");
    scanf("%s", input);

    int n1 = mbstowcs(output, input, 80);
    if (n1 == -1)
        perror("mbstowcs");
    int n2 = wcslen(output);
    int n3 = strlen(input);
    int n4 = mbslen(input);
    if (n4 == -1)
        perror("mbslen");

    printf("mbstowcs(output,input)=%d\n"
           "wcslen(output)=%d\n"
           "strlen(input)=%d\n"
           "mbslen(input)=%d\n", n1, n2, n3, n4);

    if (*input)
    {
        printf("input = ");
        const char *p = input;
        for (; *p; ++p)
            printf("0x%02X,", (unsigned char)*p);
        puts("\b ");
    }

    if (*output)
    {
        printf("output = ");
        const wchar_t *p = output;
        for (; *p; ++p)
            printf("0x%X,", (unsigned)*p);
        puts("\b ");
    }
    return 0;
}

Code:

locale = en_US.utf8
Give me something: œ∑Ω
mbstowcs(output,input)=3
wcslen(output)=3
strlen(input)=7
mbslen(input)=3
input = 0xC5,0x93,0xE2,0x88,0x91,0xCE,0xA9 
output = 0x153,0x2211,0x3A9

Code:

Using C locale
Give me something: œ∑Ω
mbstowcs: Invalid or incomplete multibyte or wide character
mbslen: Invalid or incomplete multibyte or wide character
mbstowcs(output,input)=-1
wcslen(output)=0
strlen(input)=7
mbslen(input)=-1
input = 0xC5,0x93,0xE2,0x88,0x91,0xCE,0xA9

On my Fedora 14 box, not calling setlocale causes mbstowcs to fail - due to bytes that don't belong to the C locale.
On tabstop's box, mbstowcs seems to just copy each char to a wchar_t piecemeal under the C locale.

gg

**trievideo** · 01-23-2011

Hey, a big thanks to all of you for taking the time to test and post code or otherwise offer suggestions! And, yes, I did get what I was after.

Here's the final version, tested and working, exactly as I intended:

Code:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main (int argc, const char * argv[]) {
	
	setlocale(LC_ALL, "");
	wchar_t UnicodeChar=0;
	int c2=0;
    
	wprintf(L"Type any Unicode character here:");
    wscanf(L"%lc",&UnicodeChar);
	
	wprintf(L"Type any Unicode numeric value:");
    scanf("%d",&c2);
	
	
	wprintf(L"\nThe character '%lc' has a Unicode numeric value of '%d.'\nThe numeric value '%d' represents the Unicode character '%lc.'",UnicodeChar,UnicodeChar,c2,c2);
	
	return 0;
}

It works for any Unicode character -- tested on some characters from non-Latin scripts far outside the ASCII 7-bit range. It returns, as I intended, the Unicode code point (in decimal form) of any Unicode character, and the Unicode character corresponding to any Unicode code point typed in.

Originally Posted by Codeplug

Is it that the newline is left in the input stream so that the second wscanf() pulls it out and displays 10?

Yes, Codeplug, you are right. Thanks for pointing out that issue, corrected by cas's modification to the second wscanf:

Originally Posted by cas

Your second call to scanf() won't do what you expect. You probably just want to read an int (keep using wscanf(), but switch to %d and make c2 an int); when writing out with %lc, cast the int to a wchar_t, since that's what %lc expects.

Thanks for that, cas!

Originally Posted by tabstop

Examples from my Mac terminal:...

Thanks, tabstop, for reminding me to run code in the Mac terminal, not just in Xcode: I was getting nothing but failures building and running my code in Xcode! But compiling and running my code in terminal worked. Of course, that begs the question: Why can't Xcode reproduce the success in Terminal? Probably an Xcode configuration issue, I guess.

Thread: Unicode/UTF-8 -- Going Beyond ASCII

Thread Tools

Search Thread

Display

Problem Solved

Similar Threads

C program to convert a decimal number to hex using masks and shifts

No atoh() function in C ( Ascii To Hex )? - Well, Let's Create One

Office access in C/C++ NOT VC++!! :)

ascii values for keys

Checking ascii values of char input