Thread: Chinese unicode in console and getline wont work

  1. #1
    Registered User
    Join Date
    Apr 2011
    Posts
    308

    Chinese unicode in console and getline wont work

    Here's my github:

    GitHub - jeremyduncanartificialintelligence/musical-artificial-intelligence

    When you run the program it works fine for English, but for Chinese there's one big problem, and one small problem.

    The big problem is getting the input sentence using getline and if it's a Chinese character, writing that to the file_out.txt.

    Once I get that in there I can test if my hard coded Chinese characters work in my program like it does for English.

    Smaller problem, the wcout might not work with both English and Chinese but I have to get the text into the file_out.txt first to run the program and take it from there is wcout is messing up somehow.

    I tried a bunch of different things to get the getline to take wchar Chinese unicode, but it doesn't work. Can you help me please.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    I suggest you start with presenting us with a 10 line program and a single line data file which demonstrates the problem.

    Nobody is going to wade through a mass of code like that just to identify a single issue.

    When you understand your small test case fully, THEN you can apply that to your main program.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Here is my test code

    Code:
    #include <iostream>
    #include <fstream>
    #include <cstdlib> // exit()
    
    using namespace std;
    
    int main()
    {
        // empty files
        ofstream fout("file_out.txt");
        fout.close();
    
        // get user input.
        ofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
    
        char inputSentence[4096];
    
        cout << "Enter your sentence, end it with a period: ";
        cin.getline (inputSentence,4096);
        holdInputSentence << inputSentence;
        holdInputSentence.close();
    
        ifstream read_the_counter_file("file_out.txt");
        string read_text = "";
    
        cout <<  "\n";
    
        while(read_the_counter_file >> read_text)
        {
            wcout << wstring(read_text.begin(), read_text.end()) << " ";
        }
    
        read_the_counter_file.close();
    
        cout <<  "\n\n";
    
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    text to copy and paste into program when testing it:

    Code:
    嗄 1 x #

  4. #4
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    Have you tried using only the wide versions of all of those streams, ie not mixing things like cin and wcin?

    Jim

  5. #5
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Yes I tried with only the Chinese character all I see in the console is a question mark in wcout and a square box in the input:

    I typed in: 嗄

    Code:
    Enter your sentence, end it with a period: ? 
    
    ?
    
    Press any key to continue . . .
    The 1 x # works fine

    Code:
    Enter your sentence, end it with a period: 1 x #
    
    1 x #
    
    Press any key to continue . . .
    Here's with both Chinese character and ascii

    Code:
    Enter your sentence, end it with a period: ? 1 x #
    
    ? 1 x #
    
    Press any key to continue . . .

  6. #6
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    I got some unicode to work:

    Code:
    #include <Windows.h>
    #include <iostream>
    #include <fstream>
    #include <cstdlib> // exit()
    
    using namespace std;
    
    int main()
    {
        SetConsoleOutputCP(1252);
        SetConsoleCP(1252);
        // empty files
        ofstream fout("file_out.txt");
        fout.close();
    
        // get user input.
        wofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
    
        wchar_t inputSentence[4096];
    
        cout << "Enter your sentence, end it with a period: ";
        wcin.getline (inputSentence,4096);
        holdInputSentence << inputSentence;
        holdInputSentence.close();
    
        ifstream read_the_counter_file("file_out.txt");
        string read_text = "";
    
        cout <<  "\n";
    
        while(read_the_counter_file >> read_text)
        {
            wcout << wstring(read_text.begin(), read_text.end()) << "\n ";
            cout << read_text;
        }
    
        read_the_counter_file.close();
    
        cout <<  "\n\n";
    
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    test word:
    Code:
    ção
    program results:
    Code:
    Enter your sentence, end it with a period: ção
    
    ção
    
    Press any key to continue . . .
    and this is writtenin my file_out.txt file
    Code:
    ção
    But it doesn't work for Chinese characters. Look at the code I used wcin like you said to use, and that along with wofstream, but the console uses the cout not the wcout to print ção. wcout doesn't print anything.

    So a bit of a idea of what I need to do but I want Chinese characters and still need help doing that.

    edit, my first code does the exact same thing but simpler. So no progress was made. I tried the new code with just the Chinese character and it didn't work I just got a question mark. wcin doesn't do anything to help me with Chinese characters.
    Last edited by jeremy duncan; 05-29-2016 at 11:30 PM.

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    Does your terminal font even have the glyphs present to display Chinese characters?

    Does this actually work?
    Code:
    wcout << "嗄" << endl;
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    Quote Originally Posted by Salem View Post
    Does your terminal font even have the glyphs present to display Chinese characters?

    Does this actually work?
    Code:
    wcout << "嗄" << endl;
    This is what I get when I test that code:

    Code:
    Õ
    Process returned 0 (0x0)   execution time : 0.016 s
    Press any key to continue.
    I'm using codeblocks and windows 7 64 bit sp1. I don't think I have Chinese characters in my codeblocks, I tested in cmd as well and changed the chcp to 65001 and pasted the Chinese character in your test code into cmd and it showed a square. So I don't think I have the Chinese characters on my pc for cmd and codeblocks to use? How do I fix that?

    edit
    I googled and found I have to go to windows update to install a language pack. I will try that and post what happened tomorrow:
    Install language packs using Windows Update - Windows Help
    Last edited by jeremy duncan; 05-30-2016 at 03:30 AM.

  9. #9
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    I couldn't install the language packs into English so I reinstalled windows into Japanese, it:s the only Asian language with Chinese like characters they call kanji I could see when I was choosing an install language. And sure enough that fixed the problem. Here is the working code+

    edit the code text is messed up when I copy and paste, its the first code I posted, not the one with wcin, but with cout and no wchar conversion just cout reading the string.

    Here is the input
    Code:
    嗄 1 x #
    And here is the console results
    Code:
    Enter your sentence, end it with a period: 嗄 1 x #
    
    嗄 1 x #
    
    続行するには何かキーを押してください . . .
    
    Process returned 0 (0x0)   execution time : 6.212 s
    Press any key to continue.
    It writes the Chinese character to the file_out.txt file.

    Now to reinstall windows 7 English and figure out how I can have both sets of characters available at once on my English console.
    Last edited by jeremy duncan; 05-30-2016 at 06:27 AM.

  10. #10

  11. #11
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    I'm using xp pro sp3 with all updates. I read somewhere that I had to use pro for windows 7 so I only had pro xp3 so I installed that today.

    First I followed these instructions:
    Microsoft Corporation

    Second I installed this SW:
    Download Simplified Chinese ClearType fonts for Windows XP from Official Microsoft Download Center

    Then I figured out how to get the console in cmd to take Chinese characters in a cut and paste and display the character properly.
    step 1, go to control panel, regional and language options, languages tab, details.
    click add.
    input language = Chinese (Taiwan)
    keyboard input = Chinese (simplified) - US keyboard. (This one lets me type in English but still get Chinese characters in the cmd console.)
    click apply, ok.
    step 2, control panel, regional and language options, advanced tab.
    In the box labeled "select a language to match the language version of the non-unicode programs you want to use:"
    choose = Chinese (Taiwan). (It's the same language you chose in the advanced tab, or else you won't see the Chinese character in the cmd console screen.)
    Click apply, reboot when it says to and then you can enter Chinese text into the console and have it work in the codeblocks.

    Here is the code I used to read a file with Chinese characters, and write them to a different file:

    Code:
    #include <iostream>
    #include <fstream>
    #include <cstdlib> // exit()
    
    using namespace std;
    
    int main()
    {
        // empty files
        ofstream fout("file_out.txt");
        fout.close();
    
        // get user input.
        ofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
    
        ofstream fout2("file_out.txt", fstream::app);
        ifstream read_the_counter_file("cn_font.txt");
        string read_text = "";
    
        cout <<  "\n";
    
        while(read_the_counter_file >> read_text)
        {
            fout2 << read_text;
            cout << read_text << " ";
        }
    
        read_the_counter_file.close();
        fout2.close();
    
        cout <<  "\n\n";
    
    
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    When I try to use wcout I get a compile error.

    I counted the characters in both files, the input file and output file and they have the same number of chacters: 9867.

    But when I cout the text I get a bunch of question marks mixed in with Chinese characters and some empty spaces inbetween characters.

    Here's a small bit of what it looks like:

    Code:
    ??梢韐株?隡急蝞貊?縞蝥餉盔頝
    ???憿???頧誥頧祆蝭??剛?
    ?寥???韏蝻??????隢?蝬湧
    航??X瞈舀?霂潮?隢??韏憪
    斢餉攳?頞西???蝘嗉垣?眷蝳
    衣?摰遞頦芣?擛?攳株馱?ㄚ
    霂寥盒撽粹?祈?朣箇?霂粥攳??
    ?蝻萇?蝜斤??仿鞈箇??湔?蝵芷????
    批??Y??€??
    
    Press any key to continue . . .
    So I can read the characters and write them with no question marks but in cout I get question marks, so something is going wrong with cout.
    I looked at what Codeplug wrote but it didn't seem to help me so maybe I'm trying the code wrong in his examples.

    Here is my test font:
    Code:
    绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    And here is my program cout results:

    Code:
    嚜輻遢皝??貉?憸文?蝡蔑?撲璅????憳丹暻祥?梯?擉踵?????隅??撣?
    
    Press any key to continue . . .
    So almost got it working. Can you help me fix the console out? I'm thinking it might be I'm using the language wrong and I need to use something other than Chinese (Taiwan).
    I tried a few other Chinese options but the same result.

    Edit, my files are saved as utf-8, and the codeblocks console font is the asian character one which I assume is lucida console.

    Edit, even the cout characters aren't the same ones I had in my test file, so somethings wrong here.
    Last edited by jeremy duncan; 05-31-2016 at 06:50 AM.

  12. #12
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    The problem is with the encoding conversions that occur at the CRT and console levels.

    Text that goes into wcout is expected to be UTF16LE in Windows. This get's converted to a single-byte encoding specified by the current locale. Then when it goes to the console, it get's converted again to the console's output codepage.

    Text that goes into cout is expected to be encoded according to the locale. Then when it goes to the console, it get's converted to the console's output codepage.

    The windows console does not support a UTF8 "codepage" - so dumping a UTF8 encoded file to the console is never going to work. What you want is direct Unicode output, which in Windows, is UTF16LE direct output. This can be achieved by calling WriteConsoleW(). If you are using a CRT from VS2008 or later (which codeblocks/mingw does not) then you can use that _O_U16TEXT mode mentioned in the linked post.

    gg

  13. #13
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    My new code that prints to screen, in vs 2010 express c++:

    Code:
    #include <fcntl.h>
    #include <io.h>
    #include <cstdio>
    #include <cwchar>
    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <cstdlib> // exit()
    
    using namespace std;
    int main()
    {
        _setmode(_fileno(stdout), _O_U16TEXT);
        
    	// empty files
        ofstream fout("file_out.txt");
        fout.close();
     
        // get user input.
        ofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
     
        char inputSentence[4096];
     
        wcout << L"Enter your sentence, end it with a period: ";
    
    	cin.getline (inputSentence,4096);
        holdInputSentence << inputSentence;
        holdInputSentence.close();
     
        wifstream read_the_counter_file("file_out.txt");
        wstring read_text;
     
        wcout <<  L"\n";
     
        while(read_the_counter_file >> read_text)
        {
            wcout << read_text << " ";
        }
     
        read_the_counter_file.close();
     
        wcout <<  L"\n\n";
     
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    My output for english works when I use raster font, but that asian font gets english all different symbols like hearts.

    Here is my output for my test Chinese font using raster font:

    Code:
    Enter your sentence, end it with a period: 绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞
    粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    
    ?´ï?ÅÙ???³¹¹ü¼ýºs¼ÌÁ..ähº¼ák??ã^??ö¡´x?ÉSïA?»Ù¤V?¥M
    
    Press any key to continue . . .
    and using the asian font:

    Code:
    Enter your sentence, end it with a period: 绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞
    粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    
    ?´ï?ÅÙ???³¹¹ü¼ýºs¼ÌÁ..ähº¼ák??ã^??ö¡´x?ÉSïA?»Ù¤V?¥M
    
    Press any key to continue . . .
    when I tried with wcin this code I got numbers instead of text;

    Code:
    wchar_t inputSentence[4096];
     
        cout << "Enter your sentence, end it with a period: ";
        wcin.getline (inputSentence,4096);

  14. #14
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    my code:

    Code:
    #include <fcntl.h>
    #include <io.h>
    #include <cstdio>
    #include <cwchar>
    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <cstdlib> // exit()
     
    using namespace std;
    int main()
    {
        // empty files
        ofstream fout("file_out.txt");
        fout.close();
      
        // get user input.
        ofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
      
        char inputSentence[4096];
      
        wcout << L"Enter your sentence, end it with a period: ";
     
        cin.getline (inputSentence,4096);
        holdInputSentence << inputSentence;
        holdInputSentence.close();
      
    	// display user input
        wifstream read_the_counter_file("file_out.txt");
        wstring read_text;
      
        wcout <<  L"\n";
      
        while(read_the_counter_file >> read_text)
        {
            wcout << read_text << " ";
        }
      
        read_the_counter_file.close();
      
        wcout <<  L"\n\ncopying file to new file then display new file contents below\n\n";
    
    	// empty files
        ofstream fout4("file_out.txt");
        fout4.close();
    
    	// copy file
    
    	wofstream fout3("file_out.txt", wfstream::app);
    	wfstream read_the_file("cn_font.txt" , ios::in);
        wstring read_text2;
     
        while(read_the_file >> read_text2)
        {
            fout3 << read_text2;
        }
     
        read_the_file.close();
        fout3.close();
    
    	// display the copy file
    
    	wifstream fin("file_out.txt");
        wstring read_text3;
     
        while(getline(fin, read_text3))
        {
            wcout << read_text3;
        }
     
        fin.close();
    
    	wcout <<  L"\n\n";
      
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    my cn_font file contents:

    Code:
    绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    What I copy and paste into the console when testing the program:

    Code:
    嗄 1 x #
    my program results:

    Code:
    Enter your sentence, end it with a period: 嗄 1 x #
    
    嗄 1 x #
    
    copying file to new file then display new file contents below
    
    嚜輻遢皝猹?═撘蝡敶啁?瞍單??煾??擗阡?蝎駁?阮?鍫隞厰?瘨券?銝?隞
    
    
    Press any key to continue . . .
    whats in file_out.txt after the program runs:

    Code:
    绽湛蘸菚颤彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    I am using visual studio express c++ 2010, the console is using the asian font.
    In the console options tab it says I'm using codepage 950 "Traditional Chinese big 5".

    Of note is the Chinese characters I'm using are simplified, not traditional so some fonts aren't going to display for that reason.

    How I was getting weird characters was I was using the two versions of unicode instead of utf-8.
    When the files were unicode or unicode big endiad the read file looked like some weird fonts that aren't even letters, just gibberish.
    And unicode big endian shows different gibberish than unicode does, only utf-8 for both files shows asian characters.

    I tried with the _setmode(_fileno(stdout), _O_U16TEXT); code and a bunch of other code from all over the web then I just tried without it and was fortunate enough to be using utf-8 when I did this and saw asian characters.
    I think when I installed the asian characters in the control panel regional setting I didn't need _setmode(_fileno(stdout), _O_U16TEXT); or something to be able to read the asian characters because with nothing there vs console said the codepage was 950.

    Now I have to figure out how to change the codepage to 936, which is the Chinese simplified codepage.
    I think MS made codepage 950 for xp and they didn't make a simplified version. So Iwill have to get a pro windows 7 and try to get codepage 936 there somehow.

    So I guess my program is solved. First I had to get the Chinese fonts, then I had to change the fonts in regional settings, then I had to use the correct text file encoding utf-8.
    Now I have to get the proper codepage which means I have to buy sme SW so I guess my problems solved for now.

    Thanks Codeplug and Salem and jimblumberg.

  15. #15
    Registered User
    Join Date
    Apr 2011
    Posts
    308
    my code:

    Code:
    #include <fcntl.h>
    #include <io.h>
    #include <cstdio>
    #include <cwchar>
    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <cstdlib> // exit()
     
    using namespace std;
    int main()
    {
        // empty files
        ofstream fout("file_out.txt");
        fout.close();
      
        // get user input.
        ofstream holdInputSentence("file_out.txt");
        if (!holdInputSentence) { cerr<<"file error\n"; exit(1); }
      
        char inputSentence[4096];
      
        wcout << L"Enter your sentence, end it with a period: ";
     
        cin.getline (inputSentence,4096);
        holdInputSentence << inputSentence;
        holdInputSentence.close();
      
    	// display user input
        wifstream read_the_counter_file("file_out.txt");
        wstring read_text;
      
        wcout <<  L"\n";
      
        while(read_the_counter_file >> read_text)
        {
            wcout << read_text << " ";
        }
      
        read_the_counter_file.close();
      
        wcout <<  L"\n\ncopying file to new file then display new file contents below\n\n";
    
    	// empty files
        ofstream fout4("file_out.txt");
        fout4.close();
    
    	// copy file
    
    	wofstream fout3("file_out.txt", wfstream::app);
    	wfstream read_the_file("cn_font.txt" , ios::in);
        wstring read_text2;
     
        while(read_the_file >> read_text2)
        {
            fout3 << read_text2;
        }
     
        read_the_file.close();
        fout3.close();
    
    	// display the copy file
    
    	wifstream fin("file_out.txt");
        wstring read_text3;
     
        while(getline(fin, read_text3))
        {
            wcout << read_text3;
        }
     
        fin.close();
    
    	wcout <<  L"\n\n";
      
        // I pause the console window so you can read the screen results
        system ("pause");
    }
    my cn_font file contents:

    Code:
    绽湛栈蘸菚颤张章彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    What I copy and paste into the console when testing the program:

    Code:
    嗄 1 x #
    my program results:

    Code:
    Enter your sentence, end it with a period: 嗄 1 x #
    
    嗄 1 x #
    
    copying file to new file then display new file contents below
    
    嚜輻遢皝猹?═撘蝡敶啁?瞍單??煾??擗阡?蝎駁?阮?鍫隞厰?瘨券?銝?隞
    
    
    Press any key to continue . . .
    whats in file_out.txt after the program runs:

    Code:
    绽湛蘸菚颤彰璋漳樟蟑鄣獐嫜餦麞粻鏱蔁騿掌鐣仉鞝涨障丈帐仗
    I am using visual studio express c++ 2010, the console is using the asian font.
    In the console options tab it says I'm using codepage 950 "Traditional Chinese big 5".

    Of note is the Chinese characters I'm using are simplified, not traditional so some fonts aren't going to display for that reason.

    How I was getting weird characters was I was using the two versions of unicode instead of utf-8.
    When the files were unicode or unicode big endiad the read file looked like some weird fonts that aren't even letters, just gibberish.
    And unicode big endian shows different gibberish than unicode does, only utf-8 for both files shows asian characters.

    I tried with the _setmode(_fileno(stdout), _O_U16TEXT); code and a bunch of other code from all over the web then I just tried without it and was fortunate enough to be using utf-8 when I did this and saw asian characters.
    I think when I installed the asian characters in the control panel regional setting I didn't need _setmode(_fileno(stdout), _O_U16TEXT); or something to be able to read the asian characters because with nothing there vs console said the codepage was 950.

    Now I have to figure out how to change the codepage to 936, which is the Chinese simplified codepage.
    I think MS made codepage 950 for xp and they didn't make a simplified version. So Iwill have to get a pro windows 7 and try to get codepage 936 there somehow.

    So I guess my program is solved. First I had to get the Chinese fonts, then I had to change the fonts in regional settings, then I had to use the correct text file encoding utf-8.
    Now I have to get the proper codepage which means I have to buy sme SW so I guess my problems solved for now.

    Thanks Codeplug and Salem and jimblumberg.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Unicode Support for std::getline()
    By Abyssion in forum C++ Programming
    Replies: 22
    Last Post: 04-05-2016, 04:26 PM
  2. why wont getline() work?
    By lilhawk2892 in forum C++ Programming
    Replies: 1
    Last Post: 03-30-2009, 09:55 PM
  3. Printing Unicode to console
    By jw232 in forum Windows Programming
    Replies: 7
    Last Post: 02-22-2009, 11:41 PM
  4. Replies: 4
    Last Post: 03-29-2006, 04:36 PM
  5. color wont change in console
    By Xarr in forum C++ Programming
    Replies: 3
    Last Post: 06-10-2004, 04:48 PM

Tags for this Thread