Non-English characters with cout

**MK27** · 01-30-2012

Originally Posted by Elysia

(then what the hell is the point of unicode!?!?!?!)

On your system, nothing. If you get unicode data from somewhere else (or need to transmit it) then worry about it.

Otherwise, you do not need to make use of unicode or wide character types unless you are dealing with characters that your native encoding cannot display and you want to use unicode notation for them (eg, \u2605 if you don't see a star below).

★

**Elysia** · 01-30-2012

No, the C++ standard is broken in this aspect. Unicode is the holy grail to get away from all this codepage crap.
It works perfect when you simply use UTF-16 with Windows (avoiding the C++ standard library). I have done it before, just not writing it out via std::wcout.

**MK27** · 01-30-2012

Originally Posted by manasij7479

So, as long as everything is..say.. UTF-8 (as it is in my case), can I make a program language independent simple by maintaining a resource 'dictionary' for all the literals being used ?

You don't even need to do that. The source file has a specific encoding. If you want to compile on a non UTF-8 system, then you'd convert it to whatever the compiler uses. If the compiler's encoding does not support some of your characters, then you'd need to use \u notation and wstrings, but I can't imagine there are many compilers with that issue. Of course, the standard does not guarantee that...

But I can't understand where the encoding of the compiled executable factors into this.

That determines the encoding of the output (not the encoding of the source file).

Why does it matter when the other encoding is simply another data type ?

The encoding is not another data type. The "other data types" are for handling other encodings.

Consider why this works in C:

Code:

#include <stdio.h>
#include <string.h>

int main(void) {
	char string[] = "মনসিজ";
	printf("%s %d\n", string, strlen(string));

	return 0;
}

I'm on a UTF-8 system, so the length of the string is 15, since Bengali is above \u0800 (11 bits, the max for a 2 byte UTF8 character).

There is an issue using string functions, etc, on non-ascii characters in C/C++. However, wstring doesn't really solve this if the system isn't UTF-16.

I don't know what the solution is for UTF-8, but I would look for a third party library or write functions using a char array to do what you want, such as getting the number of characters. WRT searching, just treat non-ascii characters as strings (not single elements); UTF-8 is designed in such a way that there can be no "coincidences" where a string of 3 normal chars might be mistaken for a single 3 byte character.

**MK27** · 01-30-2012

Originally Posted by Elysia

No, the C++ standard is broken in this aspect.

It's definitely very ugly.

Unicode is the holy grail to get away from all this codepage crap.
It works perfect when you simply use UTF-16 with Windows (avoiding the C++ standard library).

If the native encoding is UTF-16 (I did not realize windows can be now, sorry). In that case, you will just have a different set of problems when you receive UTF-8 data or need to send it (the web is predominantly UTF-8).

Unless all the computers in the world are forced to switch to UTF-16 (or UTF-8), no encoding is a transparent panacea.

Originally Posted by MK27

write functions using a char array to do what you want, such as getting the number of characters.

Eg:

Code:

#include <iostream>
#include <string>

using namespace std;

size_t UTF8_strlen (string &data) {
	char test = 1 << 6;
	size_t len = data.size(), total = 0;
	for (size_t i = 0; i < len; i++) {
		if (data[i] < 0) {
			if (data[i] & test) total++;
		} else total++;
	}
	return total;
}

int main (void) {
	string russian("привет мир");
	string bengali("মনসিজ");

	cout << "Length of '" << russian << "' " << UTF8_strlen(russian) << endl;
	cout << "Length of '" << bengali << "' " << UTF8_strlen(bengali) << endl;

	return 0;
}

Since those are literals, this only works on a UTF-8 system:

Length of 'привет мир' 10
Length of 'মনসিজ' 5

if (<0) and 1 << 6 work because the all bytes in a multibyte UTF8 character have the high bit set (so are negative) but only the first one will have the 2nd most significant bit set (the others start with 10).

You could use stuff like that to create a class for iterating UTF-8 strings by character.

**Elysia** · 01-30-2012

Originally Posted by MK27

If the native encoding is UTF-16 (I did not realize windows can be now, sorry).

Windows uses UTF-16 internally. Anything else is just converted to that in the end.

In that case, you will just have a different set of problems when you receive UTF-8 data or need to send it (the web is predominantly UTF-8).

Unless all the computers in the world are forced to switch to UTF-16 (or UTF-8), no encoding is a transparent panacea.

Definitely. But there are libraries to convert between different types of unicode, so that's not as much a pain in the ass as std::wcout is.

**jamesallen4u** · 01-30-2012

Thanks guys for all of your replies. Post #14 definitely increased my understanding about how Unicode works.

**Codeplug** · 02-01-2012

I decided to gather up some of my posts on extended characters in source code, Unicode, and console I/O. I've posted this information before, but not all in one thread. With a little cleanup, this might make a nice FAQ entry one day

Extended characters in your source files
First, we need to know exactly what we have in memory after compiling code with extended character literals. In the C++ standards, there are 3 types of "character sets" to consider (quoting C++03):

1) The Basic Source Character Set

Originally Posted by ISO/IEC 14882:2003(E)

Character Sets 2.2.1
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

Code:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’

2) Physical Source File Characters

Originally Posted by ISO/IEC 14882:2003(E)

Phases of translation 2.1.1.1
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set... Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. ...

3) The Execution Character Set

Originally Posted by ISO/IEC 14882:2003(E)

Phases of translation 2.1.1.5
Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).

Character Sets 2.2.3
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character ... The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific.

This hasn't really change in C++11. The new string literals in C++11 do give us control of the in-memory encoding of strings, but there still has to be a translation from the "source file character set" to the "execution character set", which is implementation defined. A better way to say this is that there is a conversion from one encoding to another.

How GCC and MSVC map "physical source file characters"
Next I'd like to talk about GCC's and MSVC's implementation defined behavior for C++03 string literals. For MSVC, I'm refering to VS2008 or higher. The first thing to look at is how they interpret the "physical source character set". If the source file has a Unicode encoding with BOM, then the source file encoding is known. If the compiler doesn't know what the source file encoding is, then:
- GCC assumes the file is encoded as specified by the Posix-locale, or if it can't get this information it assumes UTF8. Outside of a Posix emulation layer like MSYS or Cygwin, MinGW seems to assume UTF8 always (based on experimentation). This can be overridden with the command line parameter: -finput-charset.
- MSVC assumes the file is ACP encoded, or in other words, encoded using the codepage which GetACP() returns. This is the ansi codepage associated with the system-locale in Windows.

Converting to the "execution character sets"
Now that we know how source characters are interpreted, we can look at their conversion to the "execution character sets". There are 2 execution character sets: narrow and wide. Narrow strings are stored using the char type, and wide strings are stored using the wchar_t type. Here is how each compiler performs the conversion for:
Narrow literals/strings:
- GCC defaults to UTF8, unless overridden via -fexec-charset.
- MSVC always encodes using the ACP. So if you use a narrow litteral that the ACP doesn't support, you'll just get a warning and the compiler will change your character into a '?'.
Wide literals/strings:
- GCC supports both a 2 byte wchar_t (like in Windows) and 4 byte wchar_t (like on most *nix's). For 2 byte wchar_t systems, the default is UTF16. For 4 byte wchar_t systems, the default is UTF32. Both will use the system's native byte-order. This can be overridden with -fwide-exec-charset (and -fshort-wchar for forcing a 2 byte wchar_t).
- MSVC uses UTF16LE since Windows always uses a 2 byte wchar_t and is always little endian. MSVC also supports a "#pragma setlocale", which is useful if the source file is codepage encoded and contains extended characters within wide-string literals. For example, consider this statement: "const wchar_t w = L'ç';". That character is encoded as 0x87 in some codepages, and 0xE7 in other codepages. Remember that MSVC assumes the file is ACP encoded (if there is no BOM) which may be the wrong assumption. By using "#pragma setlocale(".852")", MSVC will know that the 0x87 byte in the source file is really the Unicode character U+00E7 and generates the proper wchar_t value.

So now we know what's in memory for our narrow and wide source strings. Here's what you should take away from this knowledge:
- As soon as you put extended character literals in your source code, you are in implementation defined territory.
- If you must put extended character literals in your source code: 1) save the source file with a Unicode encoding, preferably with a BOM. 2) If using MSVC, extended character literals should always be wide. 3) Hope that no one ever mangles your source code by saving it incorrectly.
- For the best compatibility with editors and compilers, use "universal character names" to represent extened characters in wide literals only. For example, use L"\u00e7" instead of L"ç".

Some code!
So now that we have something meaningful in memory for our string literals, chances are you'll want to use it with standard I/O facilities. This is where C/C++ locales become important. Consider the following [invalid] code:

Code:

#include <stdio.h>
#include <wchar.h>
int main()
{
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

Even though the recomendations above have been followed, this code still doesn't work. This is because C/C++ programs start-up with the "C" locale in effect by default, and the "C" locale does not support conversions of any characters outside the "basic character set". This code is much more likely to have success:

Code:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
    setlocale(LC_ALL, "");
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

The call to setlocale() says "I want to use the user's default narrow string encoding". This encoding is based on the Posix-locale for Posix environments. In Windows, this encoding is the ACP, which is based on the system-locale. However, the success of this code is dependent on two things: 1) The narrow encoding must support the wide character being converted. 2) The font/gui must support the rendering of that character. In Windows, #2 is often solved by setting cmd.exe's font to Lucida Console.
Here's the cooresponding C++ sample:

Code:

#include <iostream>
#include <locale>
using namespace std;
int main()
{
    wcout.imbue(locale(""));
    wcout << L"\u00e7" << endl;
    return 0;
}//main

Sadly, there are bugs in both MSVC and my Linux VM that prevent this from working properly (even though the prior C sample works fine). The bug will be fixed in VC11. My Linux VM is Mint 11, eglibc 2.13, libstdc++ 20110331, gcc 4.5.2-8ubuntu4 - I'm not sure if it's a known issue or not. The workaround for both is to call C's setlocale() instead.

Yet another conversion for Windows console I/O
Console I/O on Windows is further complicated by the existence of a "console codepage", which is distinct and separate from the standard-locale's narrow encoding. The above samples under Windows will perform the following converstions:
- L"\u00e7" (as UTF16LE) is first converted to a multi-byte (char) string using the locale's narrow encoding, the ACP. The ACP is 1252 for me, so the result is "\xe7".
- "\xe7" (as CP1252) is then converted to the Windows console codepage. For me that's CP437 by default, so the result of the converstion is "\x87".
At this point, the "\x87" either goes through WriteFile() or WriteConsoleA(). The OS will recognize that the handle is the stdout handle and will use the console codepage to interpret the bytes. Then cmd.exe just needs to be using a font that supports that character.

This extra conversion under windows can be avoided by setting the console codepage to be equal to the ACP:

Code:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <windows.h>
int main()
{
    SetConsoleOutputCP(GetACP());
    SetConsoleCP(GetACP());
    setlocale(LC_ALL, "");
    fputws(L"\u00e7\n", stdout);
    return 0;
}//main

You can also change the console codepage directly on the command line via the "chcp" command.

Direct Unicode I/O on the console
Ideally there wouldn't be any conversions involving a non-Unicode encoding. On *nix this can be done if the locale is UTF8, making the compiler's narrow "execution character set" UTF8 - which is the common case. Any conversions would then be wide to narrow, or UTF32/16 to UTF8. The nice thing here is that the conversion is lossless.

On Windows, the only way to achieve direct Unicode output is via WriteConsoleW(). The MS CRT (2008 and newer) provides a way to use C/C++ I/O facilities for direct Unicode output:

Code:

#include <fcntl.h>
#include <io.h>
#include <cstdio>
#include <cwchar>
#include <iostream>
using namespace std;
int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    fputws(L"\u00e7\n", stdout);
    wcout << L"\u00e7" << endl;
    return 0;
}//main

This will send the UTF16LE string directly to WriteConsoleW(), unless output is redirected to a file, in which case the UTF16LE string is written as a stream of bytes via WriteFile().

Questions, comments, corrections, omissions welcome.

gg

Thread: Non-English characters with cout

Thread Tools

Search Thread

Display

Similar Threads

fread + non-english characters

Implementing a English-Spanish/Spanish-English Dictionary

std::cout or using namespace std or using std::cout

Whats the difference between cout and std::cout?

Comparing two files, if two characters on the same col don't match, cout it. Works