Character Encoding

**bithub** · 01-06-2006

Let's say I have the following:

Code:

#include <windows.h>

int main(void) 
{
	wchar_t wcmsg1[100];
	wchar_t wcmsg2[100];
	char msg[] = "Hello \253 Hello";
	int n;

	n = MultiByteToWideChar(CP_ACP,0,msg, strlen(msg),wcmsg1,100);
	wcmsg1[n] = 0;
	n = MultiByteToWideChar(CP_UTF8,0,msg, strlen(msg),wcmsg2,100);
	wcmsg2[n] = 0;

	MessageBoxW(NULL,wcmsg1,L"wcmsg1",0);
	MessageBoxW(NULL,wcmsg2,L"wcmsg2",0);
  
	return 0;
}

Why does MultiByteToWideChar() drop the \253 character when you encode from UTF8? I was under the impression that UTF8 was completely backwards compatible to char strings, but this doesn't seem to be the case. Any unicode gurus know what is going on here?

**anonytmouse** · 01-07-2006

Why does MultiByteToWideChar() drop the \253 character when you encode from UTF8? I was under the impression that UTF8 was completely backwards compatible to char strings, but this doesn't seem to be the case.

Any 7 bit ascii character (0 - 127) has the same value in the UTF-8 character set. Logically, UTF-8 can not share values with 8-bit character sets because it needs values to encode unicode characters outside that range and there are many different 8-bit character sets.

Typically a UTF-8 byte with the high bit set is part of a multibyte sequence. However, according to the Wikipedia article on UTF-8, 0xFD is an invalid UTF-8 byte.

Code:

wchar_t wcmsg1[100];
n = MultiByteToWideChar(CP_ACP,0,msg, strlen(msg),wcmsg1,100);
wcmsg1[n] = 0;

Since MultiByteToWideChar can return cchWideChar (100 in this example), this is a potential off-by-one overflow.

**bithub** · 01-07-2006

Hmm, now that you mention it, I do remember that the backwards compatibility was for only 7 bit ASCII. Oh well, I guess I just need to allow the user to switch between UTF8 and ASCII encodings. Thanks for the help mouse

Since MultiByteToWideChar can return cchWideChar (100 in this example), this is a potential off-by-one overflow.

lol, I know that. It was just a quick example.

However, according to the Wikipedia article on UTF-8, 0xFD is an invalid UTF-8 byte.

\253 is 0xAB, not 0xFD

**anonytmouse** · 01-07-2006

\253 is 0xAB, not 0xFD

Ah, escape sequences are in octal (unless preceded by x) rather than decimal. Learn something new (nearly) every day.

Thread: Character Encoding

Thread Tools

Search Thread

Display

Character Encoding

Similar Threads

Printing "shapes" based on character lines and number of line inputs

<string> to LPCSTR? Also, character encoding: UNICODE vs ?

Game Pointer Trouble?

Please Help - Problem with Compilers

UNICODE and GET_STATE