Thread: Unicode/UTF-8 -- Going Beyond ASCII

  1. #16
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by msh View Post
    I might be wrong, but I think that UNICODE defines are MSVC-specific.
    #define _UNICODE ... C and C++, works on nearly every compiler I've used. It's a switch for certain defines and macros in the libraries.

    #define UNICODE ... Windows, automatically switches between ascii and unicode versions of almost every function call documented in the SDK.

    I did not realize we were on a non-windows setup... sorry about that.

  2. #17
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <string.h>
    #include <stdlib.h>
    #include <locale.h>
    #include <limits.h>
    
    /* taken from libc manual */
    size_t mbslen (const char *s)
    {
        mbstate_t state;
        size_t result = 0;
        size_t nbytes;
        memset(&state, '\0', sizeof(state));
        while ((nbytes = mbrlen(s, MB_LEN_MAX, &state)) > 0)
        {
            if (nbytes >= (size_t)-2)
                return (size_t)-1;
            s += nbytes;
            ++result;
        }
    
        return result;
    }
    
    int main(void)
    {
        char input[80] = {0};
        wchar_t output[80] = {0};
    
    #if 1
        const char *l = setlocale(LC_CTYPE, "");
        if (!l)
            perror("setlocale");
        else
            printf("locale = %s\n", l);
    #else
        puts("Using C locale");
    #endif
    
        printf("Give me something: ");
        scanf("%s", input);
    
        int n1 = mbstowcs(output, input, 80);
        if (n1 == -1)
            perror("mbstowcs");
        int n2 = wcslen(output);
        int n3 = strlen(input);
        int n4 = mbslen(input);
        if (n4 == -1)
            perror("mbslen");
    
        printf("mbstowcs(output,input)=%d\n"
               "wcslen(output)=%d\n"
               "strlen(input)=%d\n"
               "mbslen(input)=%d\n", n1, n2, n3, n4);
    
        if (*input)
        {
            printf("input = ");
            const char *p = input;
            for (; *p; ++p)
                printf("0x%02X,", (unsigned char)*p);
            puts("\b ");
        }
    
        if (*output)
        {
            printf("output = ");
            const wchar_t *p = output;
            for (; *p; ++p)
                printf("0x%X,", (unsigned)*p);
            puts("\b ");
        }
        return 0;
    }
    Code:
    locale = en_US.utf8
    Give me something: œ∑Ω
    mbstowcs(output,input)=3
    wcslen(output)=3
    strlen(input)=7
    mbslen(input)=3
    input = 0xC5,0x93,0xE2,0x88,0x91,0xCE,0xA9 
    output = 0x153,0x2211,0x3A9
    Code:
    Using C locale
    Give me something: œ∑Ω
    mbstowcs: Invalid or incomplete multibyte or wide character
    mbslen: Invalid or incomplete multibyte or wide character
    mbstowcs(output,input)=-1
    wcslen(output)=0
    strlen(input)=7
    mbslen(input)=-1
    input = 0xC5,0x93,0xE2,0x88,0x91,0xCE,0xA9
    On my Fedora 14 box, not calling setlocale causes mbstowcs to fail - due to bytes that don't belong to the C locale.
    On tabstop's box, mbstowcs seems to just copy each char to a wchar_t piecemeal under the C locale.

    gg

  3. #18
    Registered User
    Join Date
    Jan 2011
    Posts
    11

    Problem Solved

    Hey, a big thanks to all of you for taking the time to test and post code or otherwise offer suggestions! And, yes, I did get what I was after. Here's the final version, tested and working, exactly as I intended:
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    
    int main (int argc, const char * argv[]) {
    	
    	setlocale(LC_ALL, "");
    	wchar_t UnicodeChar=0;
    	int c2=0;
        
    	wprintf(L"Type any Unicode character here:");
        wscanf(L"%lc",&UnicodeChar);
    	
    	wprintf(L"Type any Unicode numeric value:");
        scanf("%d",&c2);
    	
    	
    	wprintf(L"\nThe character '%lc' has a Unicode numeric value of '%d.'\nThe numeric value '%d' represents the Unicode character '%lc.'",UnicodeChar,UnicodeChar,c2,c2);
    	
    	return 0;
    }
    It works for any Unicode character -- tested on some characters from non-Latin scripts far outside the ASCII 7-bit range. It returns, as I intended, the Unicode code point (in decimal form) of any Unicode character, and the Unicode character corresponding to any Unicode code point typed in.
    Quote Originally Posted by Codeplug View Post
    Is it that the newline is left in the input stream so that the second wscanf() pulls it out and displays 10?
    Yes, Codeplug, you are right. Thanks for pointing out that issue, corrected by cas's modification to the second wscanf:
    Quote Originally Posted by cas View Post
    Your second call to scanf() won't do what you expect. You probably just want to read an int (keep using wscanf(), but switch to %d and make c2 an int); when writing out with %lc, cast the int to a wchar_t, since that's what %lc expects.
    Thanks for that, cas!
    Quote Originally Posted by tabstop View Post
    Examples from my Mac terminal:...
    Thanks, tabstop, for reminding me to run code in the Mac terminal, not just in Xcode: I was getting nothing but failures building and running my code in Xcode! But compiling and running my code in terminal worked. Of course, that begs the question: Why can't Xcode reproduce the success in Terminal? Probably an Xcode configuration issue, I guess.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 15
    Last Post: 10-06-2009, 11:20 PM
  2. Replies: 11
    Last Post: 03-24-2006, 11:26 AM
  3. Office access in C/C++ NOT VC++!! :)
    By skawky in forum C++ Programming
    Replies: 1
    Last Post: 05-26-2005, 01:43 PM
  4. ascii values for keys
    By acid45 in forum C Programming
    Replies: 2
    Last Post: 05-12-2003, 07:13 AM
  5. Checking ascii values of char input
    By yank in forum C Programming
    Replies: 2
    Last Post: 04-29-2003, 07:49 AM