Thread: Unicode Support for std::getline()

  1. #16
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by Codeplug View Post
    >> but beware - even though it expects UTF16LE, it actually expects a fixed-length string, so any character that takes more than two bytes to store won't work!
    I don't believe that is true. Windows supports UTF16 in all it's glory.
    Perhaps my information is outdated. I was under the impression that it did not support it.
    A quick test shows that it does work on W10, though.

    Code:
    #include <Windows.h>
    
    int main()
    {
    	std::wstring w = L"��"; // Yes, this takes two UTF16 words (Looks like the board can't handle the character; get one from here).
    	MessageBoxW(nullptr, w.c_str(), L"", 0);
    }
    So yeah, sorry about that. It seems that my information may not be reliable.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  2. #17
    Registered User
    Join Date
    Aug 2010
    Location
    Poland
    Posts
    733
    Quote Originally Posted by Elysia View Post
    The easiest way to do Unicode is to use UTF8. UTF8 just "works" with your typical strings, algorithms and whatnot that expects all characters to be one byte. Save your files in UTF8, read your files in UTF8 (using the narrow versions of the API) and you should have 90% of the issues down.
    I would be very cautious saying it just "works". As long as you deal with in-memory strings only it might be fine. However, if you try to use Unicode with any other standard C++ stuff, you are out of luck.

    Streams: as of now there is completely no standard way to open a file stream using a Unicode path. The only standard constructors are const char* and std::string, which ofc. expect MBCS on Windows. There is no way (no trivial way?) to read UTF-16 stream either. Detecting BOM? Please...

    Locales: the locale support in C++ is also broken. Functions like isalpha() or toupper() work on single chars, which means that they ignore the whole idea of variable-length encodings.

    Regexes: standard regular expressions use locales under the hood, so they cannot be any better.

    The C++ support of Unicode is terrible. I find the overall design of code converters and locales overengineered, yet broken at their fundamental level. Last time I tried to use them to perform a simple Unicode-specific operation, I simply gave up and written my own solution from scratch.
    Last edited by kmdv; 04-05-2016 at 07:34 AM.

  3. #18
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by kmdv View Post
    I would be very cautious saying it just "works". As long as you deal with in-memory strings only it might be fine. However, if you try to use Unicode with any other standard C++ stuff, you are out of luck.
    Eh, I mean, I'm oversimplifying. It "mostly works" might be a better term.

    Streams: as of now there is completely no standard way to open a file stream using a Unicode path. The only standard constructors are const char* and std::string, which ofc. expect MBCS on Windows. There is no way (no trivial way?) to read UTF-16 stream either. Detecting BOM? Please...
    To deal with unicode paths, you have to use wide streams. Use a UTF locale to deal with wide -> narrow conversion (see Boost.Locale).
    Reading a UTF16 file is trivial if you keep yourself to english filenames. Just use a narrow stream and read into a string, then convert it however you want.

    The C++ support of Unicode is terrible. I find the overall design of code converters and locales overengineered, yet broken at their fundamental level. Last time I tried to use them to perform a simple Unicode-specific operation, I simply gave up and written my own solution from scratch.
    I couldn't agree more, unfortunately. The easiest way to do Unicode string processing is to either use a library (best) or convert to UTF16/32 using wchar and use wide character library functions.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #19
    Registered User
    Join Date
    Aug 2010
    Location
    Poland
    Posts
    733
    Quote Originally Posted by Elysia View Post
    To deal with unicode paths, you have to use wide streams.
    I'm not sure if we understand each other. I meant that there is no standard way to open a stream given a Unicode path. MSVC comes to the rescue and provides additional overrides of the open() function: https://msdn.microsoft.com/en-us/library/70bb7saf.aspx which take const wchar_t*. These overloads, however, do not exist in standard basic_ifstream/basic_ofstream templates which is a shame (it does not matter whether ifstream/ofstream or wifstream/wstreams are used, as these string types are not templated).

  5. #20
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    I did not actually know there is no const wchar_t* overload (I figured there was one for std::wstream, but alas). I know Microsoft provides a const wchar_t* overload for narrow streams.
    Yeah, that makes it a little harder.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #21
    Registered User
    Join Date
    Jul 2015
    Posts
    64
    Wow, this all went over my head really quite fast
    Cheers for the info! Whilst I am learning, I shall try to keep all of the points highlighted in mind.

    So when we tell Visual Studio to use a "Multi-Byte Character Set", which encoding does that refer to?

  7. #22
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by Abyssion View Post
    So when we tell Visual Studio to use a "Multi-Byte Character Set", which encoding does that refer to?
    Depends on your locale, so avoid it. Use Unicode only.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #23
    Registered User
    Join Date
    Jul 2015
    Posts
    64
    Excellent, thank you!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 9
    Last Post: 08-24-2014, 10:13 AM
  2. unicode support
    By herWter in forum Windows Programming
    Replies: 5
    Last Post: 07-09-2008, 12:27 PM
  3. SDLKey to ASCII without unicode support?
    By zacs7 in forum Game Programming
    Replies: 6
    Last Post: 10-07-2007, 03:03 AM
  4. How to determine if a font support a Unicode?
    By wow in forum Windows Programming
    Replies: 1
    Last Post: 05-20-2007, 09:50 PM
  5. Unicode vurses Non Unicode client server application with winsock2 query?
    By dp_76 in forum Networking/Device Communication
    Replies: 0
    Last Post: 05-16-2005, 07:26 AM

Tags for this Thread