Chinese unicode in console and getline wont work

**jeremy duncan** · 06-01-2016

I checked the text in the console read file out text and compared it to the fonts in the cn_font file and they were wrong again.

So I tried a bunch of different stuff and found out that since I'm using codepage 950 with the regional language set to Chinese Taiwan and now Chinese Traditional keyboard, before the keyboard was set to Chinese simplified, that I could save ansi files and save Chinese font characters in there.

it warned me that some characters wouldn't work so I thought I would save then close then reopen and see gibberish but I saw a lot of Chinese characters, not all of them but a lot of them. Remember some of my Chinese characters are simplified.

So I saved both files in my code to ansi then ran the code and got the Chinese characters in the console that are in the file. No bad characters.

Also since my code is in codepage 950 and when I had a Japanese windows 7 I could just use cout, I tried without wcout and it all worked just like English code, so no fancy unicode or utf-8 files or fancy wide character code, just regular English code and ansi files and it will all work in English territories and asian territories like Japan the same.

here is my new code that works;

Code:

#include <iostream>
#include <fstream>
#include <locale>
#include <cstdlib> // exit()
  
using namespace std;
int main()
{
    // empty files
    ofstream fout("file_out.txt");
    fout.close();
   
    // get user input.
    ofstream holdInputSentence("file_out.txt");
    if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
   
    char inputSentence[4096];
   
    cout << "Enter your sentence, end it with a period: ";
  
    cin.getline (inputSentence,4096);
    holdInputSentence << inputSentence;
    holdInputSentence.close();
   
    // display user input
    ifstream read_the_counter_file("file_out.txt");
    string read_text;
   
    cout <<  "\n";
   
    while(read_the_counter_file >> read_text)
    {
        cout << read_text << " ";
    }
   
    read_the_counter_file.close();
   
    cout <<  "\n\ncopying file to new file then display new file contents below\n\n";
 
    // empty files
    ofstream fout4("file_out.txt");
    fout4.close();
 
    // copy file
 
    ofstream fout3("file_out.txt", fstream::app);
    fstream read_the_file("cn_font.txt" , ios::in);
    string read_text2;
  
    while(read_the_file >> read_text2)
    {
        fout3 << read_text2;
    }
  
    read_the_file.close();
    fout3.close();
 
    // display the copy file
 
    ifstream fin("file_out.txt");
    string read_text3;
  
    while(getline(fin, read_text3))
    {
        cout << read_text3;
    }
  
    fin.close();
 
    cout <<  "\n\n";
   
    // I pause the console window so you can read the screen results
    system ("pause");
}

and my test results, the bad test results are the utf-8 on top, the good test results are ansi on the bottom;

Code:

---------------------
utf-8 file below
---------------------

Enter your sentence, end it with a period: 
绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗

?湛?蘸???章彰璋漳樟蟑鄣獐嫜??粻??騿掌?仉鞝?障丈?仗

copying file to new file then display new file contents below

嚜輻遢皝??貉?憸文?蝡蔑?撲璅????憳丹暻祥?梯?擉踵?????隅??撣?

Press any key to continue . . .

----------------------
ansi file below
----------------------
Enter your sentence, end it with a period: 绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞
粻鏱蔁騿掌鐣仉鞝涨障丈帐仗

?湛?蘸???章彰璋漳樟蟑鄣獐嫜??粻??騿掌?仉鞝?障丈?仗

copying file to new file then display new file contents below

?湛?蘸???章彰璋漳樟蟑鄣獐嫜??粻??騿掌?仉鞝?障丈?仗

Press any key to continue . . .

Now I just have to use the correct traditional Chinese characters and there will be no errors like question marks in the console.

So it's all good now.

**Codeplug** · 06-01-2016

If you really want to write "normal" code that should work in any locale, then you just have to set the locale to the user's locale. By default, the standard library starts up with the "C" locale, which is just the standard ascii characters 1-7F.

If you want to use Unicode, then NO codepage should ever be involved anywhere. If you know your file is UTF8 encoded, then you read in those UTF8 bytes and convert them to UTF16 via MultiByteToWideChar(). You can then send that to the console via WriteConsoleW(). No codepage is ever involved.

gg

**jeremy duncan** · 06-02-2016

Here is my vs studio 2010 express c++ code, my codepage isn't changed, my language is english and region is the us.

Code:

#include <fcntl.h>
#include <io.h>
#include <iostream>
   
using namespace std;
int main()
{
	_setmode(_fileno(stdout), _O_U16TEXT);
  
     wcout << L"绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗\n\n";

    // I pause the console window so you can read the screen results
    system ("pause");
}

This is the console results:

Code:

绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗

Press any key to continue . . .

If you compare the characters they are the same so good results.

But I can't open the file with the Chinese characters. put them in either a string or wstring, and display them properly unless I 1) set the codepage using the method I described, regional settings in control panel, 2) use ansi text, which if I use ansi in the language and region I'm in the us the ansi is all question marks when I close and open the file, so the codepage needs to be 950/asian for the ansi file to hold the characters and not be all question marks.

I tried a bunch of different stuff using my us region and english language to read the Chinese characters in the utf-8 file and could not do it. So if I'm going to read the file with Chinese characters I have to change my region and language to Chinese Taiwan us keyboard.

maybe there's a better way where I don't have the change my codepage and still read the Chinese characters from utf-8 file and cout them correctly, but I don't know how.

Here's how to change the codepage to 950 in Windows 7:
control panel >> clock, language, region >> change keyboard or other input methods >> keyboard and languages >> add >> Chinese traditional Taiwan, US keyboard >> default input language, Chinese Traditional Taiwan >> OK >> administrative >> change system locale >> Chinese traditional Taiwan >> reboot.

edit:
I just changed my codepage to 950, Chinese Taiwan, and using this code in vs 2010, ansi text file:

Code:

#include <iostream>
#include <fstream>
#include <locale>
#include <cstdlib> // exit()
   
using namespace std;
int main()
{
	// display the copy file
  
    ifstream fin("cn_font.txt");
    string read_text3;
   
    while(getline(fin, read_text3))
    {
        cout << read_text3 << endl;
    }
   
    fin.close();

    // I pause the console window so you can read the screen results
    system ("pause");
}

my console output when running the program:

Code:

湛蘸章彰璋漳樟蟑鄣獐嫜粻騿掌仉鞝障丈仗
Press any key to continue . . .

It's the same text as in the file, so its good.

That's all I can do right now. What do you think Codeplug?

edit, I just tested the file reading code I just posted in codeblocks 16.01 and it works fine.

**Codeplug** · 06-05-2016

Here is how you could do it while avoiding any conversions through a codepage:

Code:

#include <SDKDDKVer.h>
#include <Windows.h>

#include <sstream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <iostream>

#include <fcntl.h>
#include <io.h>

//------------------------------------------------------------------------------

// NOTE - This source file must be saved as Unicode with BOM!
const wchar_t UTF16_MSG[] = L"绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗";

//------------------------------------------------------------------------------

bool read_utf8_file(const char *pathname, 
                    std::wstring &contents,
                    bool b_rn2n = true) // convert "\r\n"->"\n"
{
    std::ifstream fin(pathname, std::ios::binary);
    if (!fin)
        return false;

    BYTE bom[4];
    bool bUTF8BOM = false;
    if (!fin.read((char*)bom, 4))
    {
        // if we hit eof then file is less than 4 bytes
        if (!fin.eof())
            return false;
        fin.clear();
    }//if
    else
    {
        bUTF8BOM = (bom[0] == 0xEF) && (bom[1] == 0xBB) && (bom[2] == 0xBF);
        // TODO - check for non-UTF8 BOM's
        // UTF8 BOM is 3 bytes, go back 1 byte if needed
        if (bUTF8BOM)
            fin.seekg(-1, std::ios::cur);
    }//else

    if (!bUTF8BOM)
        fin.seekg(0, std::ios::beg); // setup to re-read first 3 bytes

    std::stringstream ss;
    if (!(ss << fin.rdbuf())) // note: will fail if file is empty
        return false;
    fin.close(); // release resources

    const std::string &utf8 = ss.str();
    ss.str(""); // release resources

    // NOTE: we are assuming that the number of bytes in a UTF8 string is 
    //       always >= to the number of wchar_t's required to represent that 
    //       string in UTF16LE - which should hold true
    size_t len = utf8.length();
    std::vector<wchar_t> wbuff(len);
    int wlen = MultiByteToWideChar(CP_UTF8, 0,
                                   utf8.c_str(), len, 
                                   &wbuff[0], len);
    if (!wlen)
        return false;
    wbuff.resize(wlen); // wlen may be < len

    if (b_rn2n)
        contents.assign(wbuff.begin(), 
                        std::remove(wbuff.begin(), wbuff.end(), '\r'));
    else
        contents.assign(wbuff.begin(), wbuff.end());
    return true;
}//read_utf8_file

//------------------------------------------------------------------------------

std::wstring str_to_wstr(const std::string &str, UINT cp = CP_ACP)
{
    int len = MultiByteToWideChar(cp, 0, str.c_str(), str.length(), 0, 0);
    if (!len)
        return L"ErrorA2W";

    std::vector<wchar_t> wbuff(len + 1);
    // NOTE: this does not NULL terminate the string in wbuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!MultiByteToWideChar(cp, 0, str.c_str(), str.length(), &wbuff[0], len))
        return L"ErrorA2W";

    return &wbuff[0];
}//str_to_wstr

//------------------------------------------------------------------------------

std::string wstr_to_str(const std::wstring &wstr, UINT cp = CP_ACP)
{
    int len = WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(),
        0, 0, 0, 0);
    if (!len)
        return "ErrorW2A";

    std::vector<char> abuff(len + 1);

    // NOTE: this does not NULL terminate the string in abuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(),
        &abuff[0], len, 0, 0))
    {
        return "ErrorW2A";
    }//if

    return &abuff[0];
}//wstr_to_str

//------------------------------------------------------------------------------

using namespace std;

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    _setmode(_fileno(stdin), _O_U16TEXT);

    ofstream fout("file_out.txt", ios::binary | ios::out);
    if (!fout)
        return 1;

    // write the UTF8 BOM
    if (!fout.write("\xEF\xBB\xBF", 3))
        return 1;

    // convert our message to UTF8
    string utf8_msg = wstr_to_str(UTF16_MSG, CP_UTF8);

    if (!fout.write(utf8_msg.c_str(), utf8_msg.length()))
        return 1;

    fout.close();

    wstring file_msg;
    if (!read_utf8_file("file_out.txt", file_msg))
        return 1;

    wcout << file_msg << endl;

    wcout << L"Enter Msg:";
    
    wstring con_msg;
    getline(wcin, con_msg);

    wcout << L"You entered: " << con_msg << endl;
    return 0;
}//main

Output:

Code:

绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
Enter Msg:绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
You entered: 绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗

gg

**jeremy duncan** · 06-05-2016

Originally Posted by Salem

Does your terminal font even have the glyphs present to display Chinese characters?

Does this actually work?

Code:

wcout << "嗄" << endl;

Codeplug, that code won't display Chinese font or codepage 950 font if my computer isn't set to use the keyboard and region of China Taiwan like I showed before in the control panel.

Once I set my keyboard and region to codepage 950 your program works, I can read from the utf-8 text file, which will let me have more characters than I do now which is nice. Thank you, now I can use utf-8 instead of ansi.

And your program doesn't say "you entered" it just keeps taking text after I press enter.

edit, I wasn't happy I couldn't see the "you entered so I tweaked the text a bit and now I see it but the Chinese characters act like they did before where some couldn't be seen.

here is the code I'm testing:

Code:

#include <SDKDDKVer.h>
#include <Windows.h>
 #include <cstdlib> // exit()
#include <sstream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <iostream>
 
#include <fcntl.h>
#include <io.h>
 
//------------------------------------------------------------------------------
 
// NOTE - This source file must be saved as Unicode with BOM!
const wchar_t UTF16_MSG[] = L"绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗";
 
//------------------------------------------------------------------------------
 
bool read_utf8_file(const char *pathname, 
                    std::wstring &contents,
                    bool b_rn2n = true) // convert "\r\n"->"\n"
{
    std::ifstream fin(pathname, std::ios::binary);
    if (!fin)
        return false;
 
    BYTE bom[4];
    bool bUTF8BOM = false;
    if (!fin.read((char*)bom, 4))
    {
        // if we hit eof then file is less than 4 bytes
        if (!fin.eof())
            return false;
        fin.clear();
    }//if
    else
    {
        bUTF8BOM = (bom[0] == 0xEF) && (bom[1] == 0xBB) && (bom[2] == 0xBF);
        // TODO - check for non-UTF8 BOM's
        // UTF8 BOM is 3 bytes, go back 1 byte if needed
        if (bUTF8BOM)
            fin.seekg(-1, std::ios::cur);
    }//else
 
    if (!bUTF8BOM)
        fin.seekg(0, std::ios::beg); // setup to re-read first 3 bytes
 
    std::stringstream ss;
    if (!(ss << fin.rdbuf())) // note: will fail if file is empty
        return false;
    fin.close(); // release resources
 
    const std::string &utf8 = ss.str();
    ss.str(""); // release resources
 
    // NOTE: we are assuming that the number of bytes in a UTF8 string is 
    //       always >= to the number of wchar_t's required to represent that 
    //       string in UTF16LE - which should hold true
    size_t len = utf8.length();
    std::vector<wchar_t> wbuff(len);
    int wlen = MultiByteToWideChar(CP_UTF8, 0,
                                   utf8.c_str(), len, 
                                   &wbuff[0], len);
    if (!wlen)
        return false;
    wbuff.resize(wlen); // wlen may be < len
 
    if (b_rn2n)
        contents.assign(wbuff.begin(), 
                        std::remove(wbuff.begin(), wbuff.end(), '\r'));
    else
        contents.assign(wbuff.begin(), wbuff.end());
    return true;
}//read_utf8_file
 
//------------------------------------------------------------------------------
 
std::wstring str_to_wstr(const std::string &str, UINT cp = CP_ACP)
{
    int len = MultiByteToWideChar(cp, 0, str.c_str(), str.length(), 0, 0);
    if (!len)
        return L"ErrorA2W";
 
    std::vector<wchar_t> wbuff(len + 1);
    // NOTE: this does not NULL terminate the string in wbuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!MultiByteToWideChar(cp, 0, str.c_str(), str.length(), &wbuff[0], len))
        return L"ErrorA2W";
 
    return &wbuff[0];
}//str_to_wstr
 
//------------------------------------------------------------------------------
 
std::string wstr_to_str(const std::wstring &wstr, UINT cp = CP_ACP)
{
    int len = WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(),
        0, 0, 0, 0);
    if (!len)
        return "ErrorW2A";
 
    std::vector<char> abuff(len + 1);
 
    // NOTE: this does not NULL terminate the string in abuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(),
        &abuff[0], len, 0, 0))
    {
        return "ErrorW2A";
    }//if
 
    return &abuff[0];
}//wstr_to_str
 
//------------------------------------------------------------------------------
 
using namespace std;
 
int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    _setmode(_fileno(stdin), _O_U16TEXT);
 
    ofstream fout("file_out.txt", ios::binary | ios::out);
    if (!fout)
        return 1;
 
    // write the UTF8 BOM
    if (!fout.write("\xEF\xBB\xBF", 3))
        return 1;
 
    // convert our message to UTF8
    string utf8_msg = wstr_to_str(UTF16_MSG, CP_UTF8);
 
    if (!fout.write(utf8_msg.c_str(), utf8_msg.length()))
        return 1;
 
    fout.close();
 
    wstring file_msg;
    if (!read_utf8_file("file_out.txt", file_msg))
        return 1;
 
    wcout << file_msg << endl;
 
    wcout << L"Enter Msg:";
     
    wstring con_msg;
    getline(wcin, con_msg);
 
    wcout << con_msg << endl;
    system ("pause");
}//main

Code:

// test copy and paste Chinese characters below
// 璋漳樟蟑鄣獐

console test results:

Code:

绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
Enter Msg:璋漳樟蟑鄣獐.
ﶼ玺첼귁棤벺മ
Press any key to continue . . .

edit in vs 2010 express when I went to save the cpp file it said I had to save as unicode then it gave me a list to choose from, I could save the main.cpp file as unicode (utf-8 with signature) codepage 65001, so that's what I saved the file in. I couldn't see a "unicode with bom" option in the list., google search seems to think codepage 65001 has bom.

**Codeplug** · 06-06-2016

>>... if my computer isn't set to use the keyboard and region of China Taiwan ...
Yes. I installed the Chinese Simplified language pack and changed my system locale to Chinese Simplified. Then the console worked as expected. I'm running Win 10 x64 and Visual Studio 15 Preview 2.

Changing my locale back to English was a pain in the butt (since everything was in Chinese). So I verified the new input message with this code instead:

Code:

    wstring::const_iterator it = con_msg.begin(), it_end = con_msg.end();
    for (; it != it_end; ++it)
        wcout << L"0x" << hex << (int)*it << L' ';
    wcout << endl;

Entering "璋漳樟蟑鄣獐" gave this output:

Code:

0x748b 0x6f33 0x6a1f 0x87d1 0x9123 0x7350

I verified that the first 2 codepoints do match the input glyphs:
Unicode Han Character 'jade plaything; jade ornament' (U+748B)
Unicode Han Character 'name of a river in Henan' (U+6F33)

All you can do is ensure you are sending the correct UTF16 codepoints to the OS. You can also rule out any bug that may be in the CRT by calling WriteConsoleW() directly to see if you get a different result than wcout<<.

gg

Thread: Chinese unicode in console and getline wont work

Thread Tools

Search Thread

Display

Similar Threads

Unicode Support for std::getline()

why wont getline() work?

Printing Unicode to console

Prints IP string out to console ok, But wont fwrite() them

color wont change in console

Tags for this Thread