Unicode Support for std::getline()

**Abyssion** · 04-04-2016

Hey guys, just a quick one.

I've searched around a little and found nothing useful. When I'm coding, I always try to be Unicode-aware, whilst not actually enforcing a particular character set. (Perhaps overcomplicating things this way is not ideal, but I'm trying to get myself into good habits with regards to portability etc.)

Anyway, I'm trying to receive a line of input from the user, into a 'String' typedef (either std::string or std::wstring, depending on an "#ifndef UNICODE" directive), using the std::getline function. The issue lies in the encoding that the standard input is expecting.

Code:

String demostring;
std::getline(std::cin, demostring);

Of course, this won't compile if Unicode is enabled because std::cin is expecting ANSI input. Whereas:

Code:

String demostring;
std::getline(std::wcin, demostring);

will not compile for ANSI builds. (Interestingly enough, I note that up-to-date versions of Visual Studio (Pro) do not support ANSI-only builds. Anyone know why this is? Do Microsoft simply deem such support obsolete now?)

Anyway, what I'm actually trying to achieve can be interpolated from:

Code:

#ifndef UNICODE
typedef std::string String;
typedef std::ifstream Ifstream;
typedef std::cin Cin;
#else
typedef std::wstring String;
typedef std::wifstream Ifstream;
typedef std::wcin Cin;
#endif

Of course, this won't compile, but I think it demonstrates what I'm wanting to do. With the ifstream and wifstream typedefs, I'm wondering if it would be appropriate to initialise an 'Ifstream' and then redirect the standard input to this stream?

This seems like a convoluted way to go about it though; is there perhaps a better solution? I'm aware that I could use an "#ifndef UNICODE" directive to selectively call "std::getline(std::cin, demostring);" or "std::getline(std::wcin, demostring);" within main() itself. However, I'm just wondering if it's possible to accomplish what I'm after without throwing arbitrary preprocessor directives into the body of the code.

And there was me thinking this was going to be a "quick one"

Many thanks for you time, good people!
Abyssion

**algorism** · 04-04-2016

I suppose you could do this:

Code:

#ifndef UNICODE
typedef std::string    String;
typedef std::ifstream  Ifstream;
std::istream& Cin = std::cin;
std::ostream& Cout = std::cout;
#else
typedef std::wstring   String;
typedef std::wifstream Ifstream;
std::basic_istream<wchar_t>& Cin = std::wcin;
std::basic_ostream<wchar_t>& Cout = std::wcout;
#endif

**kmdv** · 04-04-2016

In standard C++ Unicode support is almost non-existent compared to other languages. If you need Unicode and don't want to reinvent the wheel, you should find some 3rd party library.

Different typedefs for xxx and wxxx will work, but you also need to take into account target platform. For example, on Linux char* is intended to contain a UTF-8 string. Also, in your example the condition should be reversed:

Code:

#ifndef UNICODE
typedef std::wstring String;
typedef std::wifstream Ifstream;
typedef std::wcin Cin;
#else
typedef std::string String;
typedef std::ifstream Ifstream;
typedef std::cin Cin;
#endif

**whiteflags** · 04-04-2016

You could abandon ANSI builds altogether if you want. I think you should if you're going to support Unicode anyway, but that's just me. UTF-8 dedicates the first 127 code points to ASCII AND they only require one byte per code point, so basically it supports ANSI strings. You could use that instead; the conversions to and from other formats are fast and lossless. So you could use UTF-8 for much of the program and use say UTF-16 if you need to call a function expecting that format, like much of WinAPI.

**Abyssion** · 04-04-2016

Thank you both very much for the replies!

I'm quite surprised by the lack of Unicode support in the standard to be honest; cheers for the information, kmdv. Although I'm not sure about the typedef condition reversal; note that I was using "ifndef UNICODE" as opposed to "ifdef UNICODE".

I think, for now, I'll go with algorism's suggestion, as I am currently in need of a "quick fix". For the future though, I shall definitely have a gander around for a third-party C++ Unicode library, as you suggested kmdv. If any spring to mind, please do share!

Many thanks guys, 'tis always appreciated

EDIT: Oh, thank you too whiteflags! To be honest, all of these various encodings confuse me terribly, although it does seem like ANSI is pretty much obsolete now. Sounds like a good idea about using UTF-8 primarily; I had no idea about the similarities that it shared with ANSI. Usually, as soon as I see the words "variable length character set", I run away screaming; I find it difficult enough to work with fixed length encodings. I guess it's something that I'll have to get used to though; it certainly seems to be the way the industry is heading.

**kmdv** · 04-04-2016

Originally Posted by Abyssion

Although I'm not sure about the typedef condition reversal; note that I was using "ifndef UNICODE" as opposed to "ifdef UNICODE".

Right, I misread it.

**whiteflags** · 04-04-2016

Originally Posted by Abyssion

EDIT: Oh, thank you too whiteflags! To be honest, all of these various encodings confuse me terribly, although it does seem like ANSI is pretty much obsolete now. Sounds like a good idea about using UTF-8 primarily; I had no idea about the similarities that it shared with ANSI. Usually, as soon as I see the words "variable length character set", I run away screaming; I find it difficult enough to work with fixed length encodings. I guess it's something that I'll have to get used to though; it certainly seems to be the way the industry is heading.

No problem! I completely understand being confused by unicode at first but there is a lot of help out there. I wrote a pretty good post explaining how UTF-8 is encoded, so I want to recommend that. Give the forum a good search too. If that helps you out let us know.

**Abyssion** · 04-04-2016

Actually, that whole thread is a pretty good source of information; many thanks for the link white!

I shall pour over it properly tonight and try to get some sort of grasp on these encoding schemes. It has never actually occurred to me to go and read the standards for the various encodings before; I guess I highly appreciate a compact overview like you provided

One thing that I don't understand though, is why would operating system developers and other high-level wizards opt to support a character set, by default, that is smaller than any other given set? I could understand it a few years back when memory was cripplingly limited, but why not just support the largest character set (UFT-16?) by default and have done with it?

Many thanks for your time and effort guys!

**kmdv** · 04-04-2016

Originally Posted by Abyssion

One thing that I don't understand though, is why would operating system developers and other high-level wizards opt to support a character set, by default, that is smaller than any other given set? I could understand it a few years back when memory was cripplingly limited, but why not just support the largest character set (UFT-16?) by default and have done with it?

Firstly, "the largest character set supported" as you name it (this is not valid though, as all UTF encodings support all Unicode characters) is UTF-32 which is 4 bytes per each character (code unit, specifically). Secondly, supporting the largest character set in a single value is what older encodings were supposed to do (UCS-2). Unfortunately, it quickly turned out that 16 bits are not enough to store all possible characters, as 16 bits are only 65536 different combinations. UTF-16 goes beyond 16 bit because it is a variable-length encoding (I think that some people seem to forget about it and completely ignore that important fact).

Space is the problem. Space is not a problem for few short strings, however, if you count all characters in all the strings stored in your memory and on your disk, then UTF-32 would introduce a significant overhead. Pick any HTML site source and multiply its size by 4 - that's roughly the size it would occupy in UTF-32, as over 99% of characters in a HTML page are from Latin alphabet (those which are in the ANSI encoding).

**Abyssion** · 04-04-2016

Originally Posted by kmdv

Space is the problem. Space is not a problem for few short strings, however, if you count all characters in all the strings stored in your memory and on your disk, then UTF-32 would introduce a significant overhead. Pick any HTML site source and multiply its size by 4 - that's roughly the size it would occupy in UTF-32, as over 99% of characters in a HTML page are from Latin alphabet (those which are in the ANSI encoding).

Ahh, righto. Yeah, when you put it like that, I can see what you mean; the overhead would indeed be massive, especially considering that certain values could be denoted by one or two bytes.

Just for clarification... Unicode is an encoding scheme in it's own right, and not an umbrella term for UTF-8 or 16, correct?

EDIT: I ask this because it seems strange that Visual Studio would support Unicode, specifically, as well as 'a' Multibyte Character Set (which one?) when Window's own API is largely UTF-16-based.

**Codeplug** · 04-04-2016

>> Unicode is an encoding scheme in it's own right ...
I would say that "how" it's encoded is just part of. I think of it as a set of (currently) 32-bit values that are assigned to glyphs - mostly from languages, but also things like math symbols, and even emoticons.

The standard left things mostly "implementation defined". But compiler vendors have defined what it is that they do - and library implementers have also made decisions - eg. wxWidgets / wxString using wchar_t on all platforms, and Microsoft settling on UTF16-LE represented by a 2-byte wchar_t.

Here is another post for you to digest that shows some of the implementation defined behavior: Non-English characters with cout

[I see that link is in the other thread too...]

gg

**Abyssion** · 04-04-2016

Codeplug, that's a great explanation; thank you!
Things are starting to become a bit clearer now; hopefully once I've digested the material that you and whiteflags provided, I'll be in a better position to tackle issues surrounding character encoding and platform targeting.

Many thanks again guys; hope you all have a splendid day!

**Elysia** · 04-05-2016

The easiest way to do Unicode is to use UTF8. UTF8 just "works" with your typical strings, algorithms and whatnot that expects all characters to be one byte. Save your files in UTF8, read your files in UTF8 (using the narrow versions of the API) and you should have 90% of the issues down. Internally, I'd always use UTF8 and that means std::string means "UTF8", not "ANSI". The best thing to do is use the type system to create, say, a UTF8 type and use that type in your interface. That makes it super clear that you're expecting UTF8 and not ANSI or whatever else. It also protects you against silly mistakes and makes you think twice before passing in a std;;string.

When calling APIs, be very careful to check what encoding they expect. Windows narrow APIs always expect MCBS, which is a fancy way of saying ANSI + some extra characters that depends on the country you're living it (but it's different for each country!). Windows wide APIs expect UTF16LE (but beware - even though it expects UTF16LE, it actually expects a fixed-length string, so any character that takes more than two bytes to store won't work!). Boost narrow APIs usually pass along the information to windows narrow APIs, so they expect MCBS. The standard library can do this too, with for example filestreams. The data it reads is encoding agnostic, but the filename itself is either MCBS or Unicode encoded and will be passed to the appropriate API.

For easy Unicode, remember these tips:
- Use UTF8 internally. That means you save all external files in UTF8 and read them as UTF8.
- Be sure to check the documentation for external library to see what encodings they expect. If they don't say, assume narrow chars are MCBS and wide chars are UTF16/UTF32 depending on OS.
- Avoid international filenames/paths.

**whiteflags** · 04-05-2016

The easiest way to do Unicode is to use UTF8. UTF8 just "works" with your typical strings, algorithms and whatnot that expects all characters to be one byte.

Well, they may work. Multibyte code points might be embedded in your string which would need special handling if you want to match them or loop over them smoothly. I mean, one thing I personally wouldn't enjoy is taking a UTF-8 string type iterator, executing ++*i, and then be in the middle of a code point.

**Codeplug** · 04-05-2016

>> but beware - even though it expects UTF16LE, it actually expects a fixed-length string, so any character that takes more than two bytes to store won't work!
I don't believe that is true. Windows supports UTF16 in all it's glory.

It's MBCS, multi-byte character string - but for windows, it's always DBCS (double byte character string) - where the "multi" is never more than 2. In other words, there is no Windows codepage that requires more than 2 bytes to represent a character.

gg

Thread: Unicode Support for std::getline()

Thread Tools

Search Thread

Display

Unicode Support for std::getline()

Similar Threads

winapi: how to add unicode character according to it's unicode-number?

unicode support

SDLKey to ASCII without unicode support?

How to determine if a font support a Unicode?

Unicode vurses Non Unicode client server application with winsock2 query?

Tags for this Thread