Originally Posted by
new_in_c++
i'm programming win32, and i noticed that every function has got an ANSI version (for example CreateWindowA), and a wide character version (CreateWindowW in this case). Can i use the ANSI versions freely, without any data loss?
Not if you're processing unicode text you can't.
The unicode (Wide) versions of windows api calls are selected automatically when you use ...
... at the top of your source pages. It has to be the first line.
From there on you use WCHAR (PWCHAR, LPCWCHAR, etc) types instead of CHAR types.
Alternatively you can use the TCHAR types throughout and they will autoswitch with the UNICODE define as well... In C or C++ you need to use the wcs version of all string functions.
But unicode is far from simple, in fact it's a royal pain in the backside...
Unicode is not 1 thing... there are at least 5 major variations UTF8, UTF16le, UTF16be, UTF32le and UTF32be... The numbers signify the character size, le means "Little Endian" which is most Windows systems, be means "Big Endian" which pretty much means "everyone else". So not only do you have to convert character sizes, you end up re-ordering the bytes inside each character as well.
Lots and lots to read up on... HERE
Theoretically unicode text files are supposed to have a Byte Order Message (BOM) at the beginning to make identifying the file content easier. But it's not always there so you are stuck having to "discover" the file's content.
Windows, up to 7 is internally UTF16le.
You will need to manually reverse the byte order in each character for UTF16be.
Windows provides the APIs you need for converting to and from UTF8, which is the current internet standard in MultiByteToWideChar() and WideCharToMultiByte()
UTF32 is not directly supported but there are convernstion libraries becoming available.
I read somewhere that Unicode works faster, is it true?
Given that, this is what it takes to open a "plain text" playlist file in a unicode world, I'll let you decide...
Code:
// open and translate playlist
BOOL M3UOpen(PWCHAR FileName)
{ PBYTE rf; // raw file data
DWORD br; // bytes read
// load the raw file
{ HANDLE pl; // playlist file handle
DWORD fs; // file size
// get path to file
wcsncpy(FilePath,FileName,MAX_PATH);
PathRemoveFileSpec(FilePath);
wcscat(FilePath,L"\\");
// open the file
pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
if (pl == INVALID_HANDLE_VALUE)
Exception(GetLastError());
fs = GetFileSize(pl,NULL);
rf = calloc(fs + 2, sizeof(BYTE));
if (! ReadFile(pl, rf, fs, &br, NULL))
Exception(GetLastError());
CloseHandle(pl);
if (br != fs)
Exception(0xE00640007); }
try
{ DWORD bom = *(DWORD*)rf;
if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000)) // utf32le bom
Exception(0xE0640002); // utf32be bom
else if ((bom & 0xFFFF) == 0xFFFE) // utf16be bom
{ FlipEndian(rf,br);
CopyWchar((PWCHAR) rf + 1); }
else if ((bom & 0xFFFF) == 0xFEFF) // utf16le bom
CopyWchar((PWCHAR) rf + 1);
else if ((bom & 0xFFFFFF) == 0xBFBBEF) // utf8 bom
CopyMByte(rf + 3, br - 3);
else // no known bom, probe the file
{ if (! memchr(rf, 0x00, br)) // 8 bit text has no nulls
CopyMByte(rf,br); // ansi / utf8 no bom
else
{ PBYTE lf = memchr(rf,0x0A,br); // lf is always present as 1 byte.
if (!lf)
Exception(0xE0640003);
if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) || //utf32be no bom
(!(*(DWORD*)lf & 0xFFFFFF00))) //utf32le no bom
Exception(0xE0640002);
if ((lf - rf) & 1) // big endian? (lf at odd offset)
FlipEndian(rf,br); // utf16be no bom
CopyWchar((PWCHAR) rf); } } } // utf16le no bom
finally
{ free(rf); }
return 1; }
The real annoyance is that unless you are writing strictly for personal use in english, you are almost forced to use unicode.