Thread: Unicode/UTF-8 -- Going Beyond ASCII

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    11

    Unicode/UTF-8 -- Going Beyond ASCII

    Here's a simple program that prints (on any ASCII or Unicode machine):

    1) the numeric value of any ASCII character;

    2) the ASCII character corresponding to any numeric value in the ASCII character set.

    Code:
    #include <stdio.h>
    
    int main (int argc, const char * argv[]) {
    	
    	int c1=0, c2=0;
        
        printf("Type any ASCII character here:");
        scanf("%c",&c1);
        
        printf("Type any ASCII numeric value:");
        scanf("%d",&c2);
    	
          printf("\nThe character '%c' has an ASCII numeric value of '%d.'\nThe numeric value '%d' represents the ASCII character '%c.'",c1,c1,c2,c2);
    	
          return 0;
    }

    The question is: How can I modify this code to handle Unicode/UTF-8 characters beyond ASCII?

    I've spent many hours googling and reading and experimenting, but my code continues to fail.

    I added wide character header files and changed the format specifier, for example:
    Code:
    #include <wchar.h>
    ….
    ….
    printf("%ls...")
    but that failed.

    I realize that many experienced programmers don't know how to implement Unicode/UTF-8, but I'm a noob,
    so it's particularly difficult for me! I'd greatly appreciate any help with actual code snippets.

  2. #2
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Did you not change scanf() to the wide version?
    Mainframe assembler programmer by trade. C coder when I can.

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Within the confines of the standard, the best thing you can do is call "setlocale(LC_ALL, "");" at the beginning of main(). This tells the runtime to use the users default locale settings. Here is a thread that explores some of the issues: http://cboard.cprogramming.com/c-pro...E-console.html

    gg

  4. #4
    Registered User
    Join Date
    Jan 2011
    Posts
    11
    Dino and Codeplug, thanks for your responses.
    Dino, yes, I've written the program with the wide-character equivalent of scanf, wscanf, but that failed, in part because I'm a noob
    and certainly made syntax errors, etc.

    Codeplug, I've included "setlocale(LC_ALL,"");" in the program as well, but that, too, fails. I read through the linked thread you posted, but,
    like many other Unicode threads and articles I've read, it provides a few insights but doesn't quite enable me to get this simple program running.

    Here's a recent FAILED attempt to get my simple program to work:
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <string.h>
    
    
    int main (int argc, const char * argv[]) {
    	
            setlocale(LC_ALL, "");
    	const wchar_t UnicodeChar=0;
    	int  c2=0;//To hold Unicode codepoint (decimal)
    	
        fwprintf(stderr,L"Type any Unicode character here:");
        wscanf(L"%ls",&UnicodeChar);
        
        fwprintf(stderr,L"Type any Unicode numeric value:");
        scanf("%d",&c2);
    	
    	fwprintf(stderr,L"\nThe character '%ls' has an Unicode numeric value of '%d.'\nThe numeric value '%d' represents the Unicode character '%ls.'",UnicodeChar,UnicodeChar,c2,c2);
    	
    	return 0;
    }
    This was tested under a GNU/Mac system, which encodes source code as UTF-8.

    As a beginner, I know my code is filled with syntax and other basic errors, omissions, etc.
    If anyone has any ideas about how to correct these errors and modify this code to work
    (as the ASCII program does) that would be great.

  5. #5
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    1) Don't efine your input variable as const.
    2) c2 should also be wchar_t.
    3) on lines 1 and 2 of the file insert...
    Code:
    #define UNICODE
    #define _UNICODE

  6. #6
    Registered User
    Join Date
    Jan 2011
    Posts
    11
    CommonTater, thanks for your input. Though you've probably brought me closer to success, the program still fails with “EXC_BAD_ACCESS”

    Here's the program as it looks now:
    Code:
    #define UNICODE
    #define _UNICODE
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <string.h>
    
    
    int main (int argc, const char * argv[]) {
            setlocale(LC_ALL, "");
    	wchar_t UnicodeChar=0,c2=0;
    	
        fwprintf(stderr,L"Type any Unicode character here:");
        wscanf(L"%ls",&UnicodeChar);
        
        fwprintf(stderr,L"Type any Unicode numeric value:");
        scanf("%d",&c2);
    	
    	fwprintf(stderr,L"\nThe character '%ls' has an Unicode numeric value of '%d.'\nThe numeric value '%d' represents the Unicode character '%ls.'",UnicodeChar,UnicodeChar,c2,c2);
    	
    	return 0;
    }
    Any more ideas anyone?

  7. #7
    Novice
    Join Date
    Jul 2009
    Posts
    568
    Quote Originally Posted by CommonTater View Post
    1) Don't efine your input variable as const.
    2) c2 should also be wchar_t.
    3) on lines 1 and 2 of the file insert...
    Code:
    #define UNICODE
    #define _UNICODE
    I might be wrong, but I think that UNICODE defines are MSVC-specific.
    Disclaimer: This post shows my ignorance at the time of its making. I claim ownership of but not responsibility for all errors in it. Reference at your own peril.

  8. #8
    Novice
    Join Date
    Jul 2009
    Posts
    568
    I also believe that they should come after all includes.
    Disclaimer: This post shows my ignorance at the time of its making. I claim ownership of but not responsibility for all errors in it. Reference at your own peril.

  9. #9
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    You don't want to define UNICODE or _UNICODE. They are not part of the C standard, and should only be used on systems that actually require them (you're not even allowed, as far as the C standard is concerned, to create something called _UNICODE).

    %ls is analogous to %s: you can't just store the result in a single wchar_t. Either read a single character with %lc or make UnicodeChar an array. Adjust the fwprintf() accordingly.

    Also, you're trying to read into c2 as though it's an int, which it is not. And again, when printing, %ls is wrong for a single character.

    None of this guarantees Unicode. The setlocale() call will, hopefully, if your environment is set up properly, get you UTF-8, but you can't be sure of that.

  10. #10
    Registered User
    Join Date
    Jan 2011
    Posts
    11
    Thanks very much cas (and thanks to msh, too). Yes, replacing %ls with %lc has now has gotten rid of my error messages, and the program runs half-way. It even responds correctly to the ASCII subset of characters, but still fails with Unicode character (non-ASCII Unicode, that is) input.

    Here's what the code looks like after the cas modifications:

    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <string.h>
    
    
    int main (int argc, const char * argv[]) {
    	 setlocale(LC_ALL, "");
    	wchar_t UnicodeChar=0,c2=0;
    	
        fwprintf(stderr,L"Type any Unicode character here:");
        wscanf(L"%lc",&UnicodeChar);
        
        fwprintf(stderr,L"Type any Unicode numeric value:");
        wscanf(L"%lc",&c2);
    	
    	fwprintf(stderr,L"\nThe character '%lc' has an Unicode numeric value of '%d.'\nThe numeric value '%d' represents the Unicode character '%lc.'",UnicodeChar,UnicodeChar,c2,c2);
    	
    	return 0;
    }
    Cas, I understand what you're saying about the environment/platform specific nature of running Unicode/UTF-8 programs, so if anyone would like to test this program on an GNU/POSIX system (like mine), that would be great. As it is, though, it's likely to fail, as it does for me.

    Any more suggestions anyone?

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> but I think that UNICODE defines are MSVC-specific.
    It's more Windows specific. There are many compilers that target Windows and use it's conventions from the PSDK. They would need to be before all includes - preferably at makefile/project settings level. These settings don't affect standard C code.

    >> wscanf(L"%ls",&UnicodeChar);
    You asked for a string, but only provided storage for a single wchar_t.

    >> scanf("%d",&c2);
    This is against the rules. stdin has been oriented as a wide stream and can no longer be used for narrow (char) input. Once a stream orientation has been set, it can't be changed until it's closed and re-opened. The first I/O operation on a stream sets its orientation.

    (ah, well, cas covered most of this already....)

    >> As it is, though, it's likely to fail, as it does for me.
    What is the problem? What are your typing, and what is the output? How is it different from what you expected?

    Is it that the newline is left in the input stream so that the second wscanf() pulls it out and displays 10?

    gg

  12. #12
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    You have correctly noted above that you are using UTF-8. That means that you are not going to be able to use wide strings for input, since wchar_t strings require every character to be the same size (rather than some one-byte and some two-byte characters). (EDIT: Or at least I should add, I've not made it work yet. I fully admit that doesn't mean it can't be done, but.) Examples from my Mac terminal:
    Code:
    mini-genius:helping andrewf$ cat uc.c
    #include <stdio.h>
    #include <wchar.h>
    #include <string.h>
    #include <stdlib.h>
    
    int main(void) {
    
        char input[80];
        wchar_t output[80];
    
        printf("Give me something: ");
        scanf("%s", input);
        printf("I think you gave me %d characters.\n", strlen(input));
        int foo = mbstowcs(output, input, 80);
        printf("I converted %d characters, and here they are: %ls.\n", foo, output);
        printf("The length of the wide string is %d.\n", wcslen(output));
        return 0;
    }
    
    mini-genius:helping andrewf$ ./uc
    Give me something: œ∑Ω
    I think you gave me 7 characters.
    I converted 7 characters, and here they are: œ∑Ω.
    The length of the wide string is 7.
    FURTHER EDIT: Okay so I can make it work with wide input. I would have sworn I had tried this earlier, but I must have missed a w or an l or something.
    Code:
    #include <locale.h>
    #include <wchar.h>
    
    int main(void) {
        wchar_t c;
        setlocale(LC_CTYPE, "en_US.utf8");
        while (1==wscanf(L"%lc", &c)) {
            wprintf(L"%x=%lc\n", c, c);
        }
        return 0;
    }
    Last edited by tabstop; 01-22-2011 at 01:06 PM.

  13. #13
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    Cas, I understand what you're saying about the environment/platform specific nature of running Unicode/UTF-8 programs, so if anyone would like to test this program on an GNU/POSIX system (like mine), that would be great. As it is, though, it's likely to fail, as it does for me.
    It works fine for me with characters whose values are greater than 255. You have to have the $LANG environment variable set properly, though: mine is set to "en_US.UTF-8". This is what setlocale() will read from, and if it's not set (or set to something like en_US.iso88591) then you won't get Unicode.

    Your second call to scanf() won't do what you expect. You probably just want to read an int (keep using wscanf(), but switch to %d and make c2 an int); when writing out with %lc, cast the int to a wchar_t, since that's what %lc expects.

  14. #14
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    Quote Originally Posted by tabstop
    That means that you are not going to be able to use wide strings for input, since wchar_t strings require every character to be the same size (rather than some one-byte and some two-byte characters).
    If I understand correctly, the following program should (depending on $LANG setting) do what you want:
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    
    int main(void)
    {
      wchar_t input[80];
    
      setlocale(LC_ALL, "");
    
      wprintf(L"Give me something: ");
      wscanf(L"%ls", input);
      wprintf(L"I think you gave me %zu characters, and here they are: %ls\n", wcslen(input), input);
    
      return 0;
    }
    
    $ LANG=en_US.UTF-8 ./a.out
    Give me something: œ∑Ω
    I think you gave me 3 characters, and here they are: œ∑Ω
    
    $ LANG= ./a.out
    Give me something: œ∑Ω                                                     
    I think you gave me 6 characters, and here they are: ?????
    By using setlocale() (with a proper $LANG) you're telling the library to read in UTF-8 data; it'll have no problem reading different-sized characters and stuffing each one into a wchar_t.

    Of course, there's no guarantee that UTF-8 support will exist in the library... and most of these wchar_t functions were introduced in C99 (well, C95 if you count that one).

  15. #15
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by cas View Post
    If I understand correctly, the following program should (depending on $LANG setting) do what you want:
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    
    int main(void)
    {
      wchar_t input[80];
    
      setlocale(LC_ALL, "");
    
      wprintf(L"Give me something: ");
      wscanf(L"%ls", input);
      wprintf(L"I think you gave me %zu characters, and here they are: %ls\n", wcslen(input), input);
    
      return 0;
    }
    
    $ LANG=en_US.UTF-8 ./a.out
    Give me something: œ∑Ω
    I think you gave me 3 characters, and here they are: œ∑Ω
    
    $ LANG= ./a.out
    Give me something: œ∑Ω                                                     
    I think you gave me 6 characters, and here they are: ?????
    By using setlocale() (with a proper $LANG) you're telling the library to read in UTF-8 data; it'll have no problem reading different-sized characters and stuffing each one into a wchar_t.

    Of course, there's no guarantee that UTF-8 support will exist in the library... and most of these wchar_t functions were introduced in C99 (well, C95 if you count that one).
    Yeah, I managed to get that fixed. I have no idea how I managed to mess it up the first <large number> times I tried it, but I've edited above.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 15
    Last Post: 10-06-2009, 11:20 PM
  2. Replies: 11
    Last Post: 03-24-2006, 11:26 AM
  3. Office access in C/C++ NOT VC++!! :)
    By skawky in forum C++ Programming
    Replies: 1
    Last Post: 05-26-2005, 01:43 PM
  4. ascii values for keys
    By acid45 in forum C Programming
    Replies: 2
    Last Post: 05-12-2003, 07:13 AM
  5. Checking ascii values of char input
    By yank in forum C Programming
    Replies: 2
    Last Post: 04-29-2003, 07:49 AM