ANSI or UNICODE

**new_in_c++** · 03-13-2011

i'm programming win32, and i noticed that every function has got an ANSI version (for example CreateWindowA), and a wide character version (CreateWindowW in this case). Can i use the ANSI versions freely, without any data loss?

I read somewhere that Unicode works faster, is it true?

**~~CommonTater~~** · 03-13-2011

Originally Posted by new_in_c++

i'm programming win32, and i noticed that every function has got an ANSI version (for example CreateWindowA), and a wide character version (CreateWindowW in this case). Can i use the ANSI versions freely, without any data loss?

Not if you're processing unicode text you can't.

The unicode (Wide) versions of windows api calls are selected automatically when you use ...

Code:

#define UNICODE

... at the top of your source pages. It has to be the first line.

From there on you use WCHAR (PWCHAR, LPCWCHAR, etc) types instead of CHAR types.
Alternatively you can use the TCHAR types throughout and they will autoswitch with the UNICODE define as well... In C or C++ you need to use the wcs version of all string functions.

But unicode is far from simple, in fact it's a royal pain in the backside...
Unicode is not 1 thing... there are at least 5 major variations UTF8, UTF16le, UTF16be, UTF32le and UTF32be... The numbers signify the character size, le means "Little Endian" which is most Windows systems, be means "Big Endian" which pretty much means "everyone else". So not only do you have to convert character sizes, you end up re-ordering the bytes inside each character as well.

Lots and lots to read up on... HERE

Theoretically unicode text files are supposed to have a Byte Order Message (BOM) at the beginning to make identifying the file content easier. But it's not always there so you are stuck having to "discover" the file's content.

Windows, up to 7 is internally UTF16le.

You will need to manually reverse the byte order in each character for UTF16be.

Windows provides the APIs you need for converting to and from UTF8, which is the current internet standard in MultiByteToWideChar() and WideCharToMultiByte()

UTF32 is not directly supported but there are convernstion libraries becoming available.

I read somewhere that Unicode works faster, is it true?

Given that, this is what it takes to open a "plain text" playlist file in a unicode world, I'll let you decide...

Code:

// open and translate playlist
BOOL M3UOpen(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWchar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWchar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWchar((PWCHAR) rf);  } } }            // utf16le no bom
     finally  
      { free(rf); }
    return 1; }

The real annoyance is that unless you are writing strictly for personal use in english, you are almost forced to use unicode.

**Elysia** · 03-13-2011

You can use ANSI without problems so long as you don't use non-english characters.
If you are planning to use Unicode, I'd suggest a library, such as UTF8-CPP: UTF-8 with C++ in a Portable Way.

**~~CommonTater~~** · 03-13-2011

Originally Posted by Elysia

Check out "his" other threads.

Sorry... I did just that ... post removed.

**new_in_c++** · 03-13-2011

Thanks, but i have got another question. You stated that i could use

Code:

#define UNICODE

to automatically select Unicode, so can write:

Code:

#define ANSI

the same way?

**GReaper** · 03-13-2011

Originally Posted by Elysia

You can use ANSI without problems so long as you don't use non-english characters.[/url].

I think the problem occurs when you want to port a program from a specific OS version ( e.g Greek ) to another ( e.g English ). If in the former you've typed specific language letters, the latter will display them a whole lot differently. It's the eternal problem of character encoding. Even UNICODE that was created to address this issue fell prey to it!

**~~CommonTater~~** · 03-13-2011

Originally Posted by new_in_c++

Thanks, but i have got another question. You stated that i could use

Code:

#define UNICODE

to automatically select Unicode, so can write:

Code:

#define ANSI

the same way?

Nope... just don't define UNICODE the default is OEM-ANSI ...

**new_in_c++** · 03-13-2011

But i didn't define anything, but all my functions take wide characters by default. (i'm using VC++ 2008 at the moment)

**Elysia** · 03-13-2011

It's a project setting. And you really need to upgrade to 2010.

**NeonBlack** · 03-13-2011

Originally Posted by CommonTater

"Little Endian" which is most Windows systems, be means "Big Endian" which pretty much means "everyone else".

That's not true at all. Anything running an Intel or ARM chip is going to be little endian (actually, I think ARM can be switched, but every device I've worked on has had it set to little). This includes Windows, Mac and Linux as well as every major smartphone/mobile platform. The only time I've ever used a big endian system was Solaris on SPARC (not open solaris- that's x86). The only other big endian system you might have a slim chance of encountering is powerpc on older macs.

**~~CommonTater~~** · 03-13-2011

Originally Posted by NeonBlack

That's not true at all. Anything running an Intel or ARM chip is going to be little endian (actually, I think ARM can be switched, but every device I've worked on has had it set to little). This includes Windows, Mac and Linux as well as every major smartphone/mobile platform. The only time I've ever used a big endian system was Solaris on SPARC (not open solaris- that's x86). The only other big endian system you might have a slim chance of encountering is powerpc on older macs.

Hey thanks for that... I just revisited the link and see that you're correct.

Oddly enough I just got handed over 1000 M3U Playlists to convert to UTF8 for the EXM3U spec. (yeah I know old stuff) and something like 700 of them are big endian... So maybe a little "experiential skew" got in there

Originally Posted by new_in_c++

But i didn't define anything, but all my functions take wide characters by default. (i'm using VC++ 2008 at the moment)

It's not a compiler default so it must be coming from the IDE. Look in the project and compiler settings for Code::Blocks ... selecte the VC compiler and look for "extra defines" or such.

Of course you can also turn it off by placing this at the top of your files...

Code:

#ifdef UNICODE
#undef UNICODE
#endif

Afterthought... how are you calling your WinAPI functions? By the generic names? eg: FindFirstFile() With rare exception you should not call the specific A or W versions directly.

**Elysia** · 03-14-2011

Originally Posted by CommonTater

...It's not a compiler default so it must be coming from the IDE. Look in the project and compiler settings for Code::Blocks ... selecte the VC compiler and look for "extra defines" or such...

Did you miss the part about using Visual Studio?
It's under project settings -> character set.
And again, I urge everyone using 2008 to upgrade to 2010. Stop being stuck in the past.

**~~CommonTater~~** · 03-14-2011

Originally Posted by Elysia

Did you miss the part about using Visual Studio?
It's under project settings -> character set.
And again, I urge everyone using 2008 to upgrade to 2010. Stop being stuck in the past.

Isn't that the compiler that's included in the Windows SDK?

Lots of people use that with Code::Blocks.

**Elysia** · 03-14-2011

Visual Studio is an IDE. You are thinking of the compiler that comes with the IDE.

**~~CommonTater~~** · 03-14-2011

Originally Posted by Elysia

Visual Studio is an IDE. You are thinking of the compiler that comes with the IDE.

Yeah... the compiler that comes with the Win 7 SDK is from msvc++ 2008.... ??

Thread: ANSI or UNICODE

Thread Tools

Search Thread

Display

ANSI or UNICODE

Similar Threads

Converting unicode filenames to ANSI

Dealing with Unicode and ANSI - Templates

<string> to LPCSTR? Also, character encoding: UNICODE vs ?

Unicode v ANSI Calls

UNICODE and GET_STATE

Tags for this Thread