Thread: Character Encoding

  1. #1
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268

    Character Encoding

    Let's say I have the following:
    Code:
    #include <windows.h>
    
    int main(void) 
    {
    	wchar_t wcmsg1[100];
    	wchar_t wcmsg2[100];
    	char msg[] = "Hello \253 Hello";
    	int n;
    
    	n = MultiByteToWideChar(CP_ACP,0,msg, strlen(msg),wcmsg1,100);
    	wcmsg1[n] = 0;
    	n = MultiByteToWideChar(CP_UTF8,0,msg, strlen(msg),wcmsg2,100);
    	wcmsg2[n] = 0;
    
    	MessageBoxW(NULL,wcmsg1,L"wcmsg1",0);
    	MessageBoxW(NULL,wcmsg2,L"wcmsg2",0);
      
    	return 0;
    }
    Why does MultiByteToWideChar() drop the \253 character when you encode from UTF8? I was under the impression that UTF8 was completely backwards compatible to char strings, but this doesn't seem to be the case. Any unicode gurus know what is going on here?

  2. #2
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    Why does MultiByteToWideChar() drop the \253 character when you encode from UTF8? I was under the impression that UTF8 was completely backwards compatible to char strings, but this doesn't seem to be the case.
    Any 7 bit ascii character (0 - 127) has the same value in the UTF-8 character set. Logically, UTF-8 can not share values with 8-bit character sets because it needs values to encode unicode characters outside that range and there are many different 8-bit character sets.

    Typically a UTF-8 byte with the high bit set is part of a multibyte sequence. However, according to the Wikipedia article on UTF-8, 0xFD is an invalid UTF-8 byte.
    Code:
    wchar_t wcmsg1[100];
    n = MultiByteToWideChar(CP_ACP,0,msg, strlen(msg),wcmsg1,100);
    wcmsg1[n] = 0;
    Since MultiByteToWideChar can return cchWideChar (100 in this example), this is a potential off-by-one overflow.

  3. #3
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Hmm, now that you mention it, I do remember that the backwards compatibility was for only 7 bit ASCII. Oh well, I guess I just need to allow the user to switch between UTF8 and ASCII encodings. Thanks for the help mouse

    Since MultiByteToWideChar can return cchWideChar (100 in this example), this is a potential off-by-one overflow.
    lol, I know that. It was just a quick example.

    However, according to the Wikipedia article on UTF-8, 0xFD is an invalid UTF-8 byte.
    \253 is 0xAB, not 0xFD

  4. #4
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    \253 is 0xAB, not 0xFD
    Ah, escape sequences are in octal (unless preceded by x) rather than decimal. Learn something new (nearly) every day.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 11
    Last Post: 10-07-2008, 06:19 PM
  2. <string> to LPCSTR? Also, character encoding: UNICODE vs ?
    By Kurisu33 in forum C++ Programming
    Replies: 7
    Last Post: 10-09-2006, 12:48 AM
  3. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM
  4. Please Help - Problem with Compilers
    By toonlover in forum C++ Programming
    Replies: 5
    Last Post: 07-23-2005, 10:03 AM
  5. UNICODE and GET_STATE
    By Registered in forum C++ Programming
    Replies: 1
    Last Post: 07-15-2002, 03:23 PM