Thread: Ansi vs. Wide (std::string/std::wstring)

  1. #1
    Registered User
    Join Date
    Aug 2010
    Location
    Poland
    Posts
    733

    Ansi vs. Wide (std::string/std::wstring)

    Hi there,

    I have no specific question here, but I would like to know what others prefer/think/do etc. I follow that everyone on the earth should learn english and whenever create something I do it in English.

    Since C++ does not care much about encodings I have always been using std::string, but - I have started considering wide chars recently and immediately got some troubles - as usual.

    Because I think that support for ANSI is mandatory, only conditional compilance can be taken into consideration. No "std::wstring everywhere" solution.

    The following are the major annoyances:

    1. The first one is lack of generics. I would like to have tstring (and tchar) typedef'ed, which would be std::wstring in UNICODE release and std::string in ANSI one. Because I need to use C strings sometimes, I am forced to create conditional macros for strcpy/wcscpy strlen/wcslen etc. There are no predefined ones.

    2. Lack of wchar_t support. Stream opening functions do not take const wchar_t*. So, if I want to open a windows file with unicode characters in path I still need to use CreateFileW (somehow...?). It also applies to std::exception.

    3. UTF madness. One compiler might use UTF-8 (by default) for wchars while another one UTF-32 or yet something else. For chars they will all use regular ASCII codes. I do not know how it is related to english-only characters (whether they have the same codes), but I assume that if I decide to use wide chars I can expect everything. This is especially a problem when writing portable file formats.

    4. I have no idea whether file streams will have problems with reading UNICODE/ANSI text files. Will they automatically convert it? (For example: reading 8-bit characters and putting them into 16-bit string).

    I am a bit new to UNICODE and I think I will get back to std::string.

  2. #2
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    For most real world applications I think people actually use system-specific APIs such as WinAPI when they want to support an encoding (through regional changes on the system) or use a Unicode scheme. For instance I know that DrawText() will work for any encoding, as long as their is a font available for it.

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    I know that DrawText() will work for any encoding
    No you don't, because it doesn't.

    The `DrawText' API will work for Windows incomplete Unicode (as UTF16), the user's specified locale code page, or the Windows variation of extended ASCII.




    Anyway, I suggest everyone trying to deal with internationalization stuff look at ICU, Pango, or something like them. It may seem like using a sledgehammer to swat a fly at first, but dealing with those issues solely within the standard library is impossible. You'll wind up spending way more time trying to provide a decent implementation of those same facilities than actually getting any real work done.

    Soma

  4. #4
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    Quote Originally Posted by kmdv View Post
    1. The first one is lack of generics. I would like to have tstring (and tchar) typedef'ed, which would be std::wstring in UNICODE release and std::string in ANSI one. Because I need to use C strings sometimes, I am forced to create conditional macros for strcpy/wcscpy strlen/wcslen etc. There are no predefined ones.
    I usually create my own Tstring typedef which is basically just std::basic_string<TCHAR>
    "I am probably the laziest programmer on the planet, a fact with which anyone who has ever seen my code will agree." - esbo, 11/15/2008

    "the internet is a scary place to be thats why i dont use it much." - billet, 03/17/2010

  5. #5
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by kmdv View Post
    I am a bit new to UNICODE and I think I will get back to std::string.
    If you are working in windows you can #define UNICODE and all the API calls switch automatically to the W versions... If you use TCHAR and PTCHAR instead of CHAR and PCHAR they will switch as well.

    I'm attaching a tchar.h file which will respond to #define _UNICODE and sets up t style aliases for most of the standard C-99 library... (Depending on your compiler, you may need to make some minor changes...)

    That just leaves C++ to mess with....

    Of course the easy way is to #define UNICODE for windows and do everything in wide strings. Windows internal unicode is utf16le and line ends are CR/LF pairs, so the beginning of any unicode text file should be tagged with the 0xFFFE BOM (Byte Order Message).
    Last edited by CommonTater; 03-06-2011 at 12:44 AM.

  6. #6
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by kmdv View Post
    1. The first one is lack of generics. I would like to have tstring (and tchar) typedef'ed, which would be std::wstring in UNICODE release and std::string in ANSI one. Because I need to use C strings sometimes, I am forced to create conditional macros for strcpy/wcscpy strlen/wcslen etc. There are no predefined ones.
    There are predefined ones, but I don't think they're standard. For example, Microsoft compilers have them. Eg: _tcslen or something.

    2. Lack of wchar_t support. Stream opening functions do not take const wchar_t*. So, if I want to open a windows file with unicode characters in path I still need to use CreateFileW (somehow...?). It also applies to std::exception.
    Incorrect. What you use daily, such as cout, fstream, etc are all typedefs of the basic class basic_ostream, etc. These base classes support other character sets.
    What you are looking for is the wide-versions, such as wifstream and wcout.

    4. I have no idea whether file streams will have problems with reading UNICODE/ANSI text files. Will they automatically convert it? (For example: reading 8-bit characters and putting them into 16-bit string).
    Believe me when I say that will be a pain.
    Use an external library to read/write the files.
    For example, this: UTF8-CPP: UTF-8 with C++ in a Portable Way
    I used it, and it seems to be rather good. It could read UTF-8 files and then convert it to UTF-16 (Windows) so I could use them with Windows functions for the GUI.

    I am a bit new to UNICODE and I think I will get back to std::string.
    That will leave you wanting quite a bit. Should you use non-english characters, you'll end up being severely annoyed to have to deal with locales. Use an external library to do all your internal processing, then convert to appropriate unicode format when display it on the screen.

    Good luck. You're in the same boat as me, now. I need Unicode in my code, as well. We might as well try to benefit from each other's experiences.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #7
    Registered User
    Join Date
    Aug 2010
    Location
    Poland
    Posts
    733
    Thanks for the replies.

    If you are working in windows you can #define UNICODE and all the API calls switch automatically to the W versions... If you use TCHAR and PTCHAR instead of CHAR and PCHAR they will switch as well.
    I know about WinAPI stuff and this is not a problem. I stay away from its typedefs because it makes code non-portable.

    Incorrect. What you use daily, such as cout, fstream, etc are all typedefs of the basic class basic_ostream, etc. These base classes support other character sets.
    What you are looking for is the wide-versions, such as wifstream and wcout.
    You're right, I missed it (the open() method). Exceptions are still char, but it does not actually matter.

    Use an external library to read/write the files.
    For example, this: UTF8-CPP: UTF-8 with C++ in a Portable Way
    I'm trying to stay away from external libraries, especially the utility ones, but if I had to use C++ locales instead I would rather move to such an external library.

    I separate GUI from core (making two independent modules), so I'm not talking about display at all. The problem are internals, especially file formats. When it comes to ASCII I can open stream in binary mode and read it as raw memory without caring about anything.

    In fact, I do not need UNICODE support at the moment. I wanted to reserve 2 bytes per char.

    I am just thinking about the future, since it may go in one of two ways:
    People will learn english or they will not.

    Who needs IDE in his/her native language? Noone, because it is confusing and unnecessary. I dream of this happening to every other kind of software.
    Last edited by kmdv; 03-06-2011 at 09:59 AM.

  8. #8
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    There is nothing wrong in using an external library written in pure C++ (ie, portable). Such attitudes are not becoming of programmers. It makes more work for less gain.
    If it needed you to bundle a DLL or similar, I could understand, but in this case, it does not. Merge it into your code and you're done.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. ANSI v.s. Wide
    By audinue in forum C Programming
    Replies: 4
    Last Post: 06-20-2008, 08:06 AM
  2. Unicode v ANSI Calls
    By Davros in forum Windows Programming
    Replies: 3
    Last Post: 04-18-2006, 09:35 AM
  3. Wide string to Ansi String
    By Davros in forum C++ Programming
    Replies: 7
    Last Post: 08-08-2004, 08:40 AM
  4. sigaction() and ANSI C
    By awoodland in forum Linux Programming
    Replies: 4
    Last Post: 04-25-2004, 01:48 AM
  5. Replies: 5
    Last Post: 10-09-2002, 12:37 PM