Thread: Help on getting wreges to work properly with Unicode strings

  1. #1
    بابلی ریکا Masterx's Avatar
    Join Date
    Nov 2007
    Location
    Somewhere nearby,Who Cares?
    Posts
    497

    Help on getting wreges to work properly with Unicode strings

    Hello all.
    I am trying to read the contents of a file and create a Monogram list out of it .
    For Ascii based and non-Unicode strings its fine. But for Unicode based string the there are couple of issues i couldnt solve yet.
    First of All this is my code for tokenizing the file:

    Code:
    #include "stdafx.h"
    #include <iostream>
    #include <regex>
    #include <fstream>
    #include <string>
    #include <map>
    #include <locale>
    using namespace std;
    
    
    int main()
    {
        string path = "";    
        
        map<wstring, int> container;
        wifstream file("ftest.txt"); 
    
        wregex reg(L"\\w+");
        wstring s = L"";
        while (file.good())
        {
            file>>s;
            for ( wsregex_iterator it (s.begin(), s.end(), reg), it_end ; it != it_end ; ++it)
            {
                container[(wstring)(*it)[0]]++ ;
            }
        }
    
        cout <<"\nDone..."<< endl;
        wofstream output("list.txt",ios::app);
        for (auto item : container)
        {
            //wcout<<item.first <<" : "<<item.second <<endl;
            //write output to list.txt
            output<<item.first <<" : "<<item.second <<endl;
        }
    
        system("pause");
        return 0;
    }


    This bit of code creates garbage output! I am trying to store them in a map and a file of type wostream so that the values are intact and correct.
    After running the application the file containing the extracted tokens, contain just garbage!!!

    This is the sample input (ftest.txt):
    بسم الله الرحمن الرحیم
    واشنگتن پست طی گزارشی اعلام کرد کنگره آمریکا برخلاف رویه سابق، ارسال مصوبه سالانه خود در زمینه تحریم های ایران به کاخ سفید را به تاخیر انداخت و به نظر می رسد انتخاب حسن روحانی به عنوان رئیس جمهوری جدید ایران علت این امر بوده است.
    0 0 0 نظر
    [-] اندازه متن [+]


    به دنبال انتخاب حسن روحانی به عنوان رئیس جمهوری جدید ایران، کنگره آمریکا بر اساس برخی ملاحظات ارسال مصوبه سالانه خود در زمینه تحریم های ایران به کاخ سفید را به تاخیر انداخت
    .


    and This is the resulting output (list.txt):
    Code:
    0 : 3
    1 : 1
    14 : 1
    16 : 1
    26 : 1
    27 : 1
    5 : 2
    50 : 1
    6 : 1
    7 : 1
    ط : 475
    طھ : 12
    طھط : 20
    طھطµظ : 1
    طھظ : 10
    طھغ : 2
    ط² : 6
    ط²ط : 6
    ط²ظ : 6
    ط³ : 5
    ط³ط : 12
    ط³طھ : 8
    ط³طھط : 4
    ط³طھظ : 2
    ط³ظ : 10
    ط³غ : 1
    طµ : 1
    طµط : 1
    طµظ : 6
    ط¹ط : 1
    ط¹ظ : 8
    ظ : 291
    ع : 54
    غ : 95
    ï : 1
    Highlight Your Codes
    The Boost C++ Libraries (online Reference)

    "...a computer is a stupid machine with the ability to do incredibly smart things, while computer programmers are smart people with the ability to do incredibly stupid things. They are,in short, a perfect match.."
    Bill Bryson


  2. #2
    بابلی ریکا Masterx's Avatar
    Join Date
    Nov 2007
    Location
    Somewhere nearby,Who Cares?
    Posts
    497
    Ok Thanks to God its solved and here is the final code which runs just fine on UTF-8 contents such as farsi
    Code:
    #include "stdafx.h"
    #include <iostream>
    #include <regex>
    #include <fstream>
    #include <string>
    #include <map>
    #include <fcntl.h> // for _wfopen_s
    #include <io.h> //for _setmode
    
    
    using namespace std;
    
    int main()
    {
        string path = "";    
    
        map<wstring, int> container;
    
         FILE* fp;
        _wfopen_s (&fp, L"ftest.txt", L"r");
        _setmode (_fileno (fp), _O_U8TEXT);
    
        wifstream file(fp);
        wregex reg(L"\\w+");
    
        wstring s = L"";
    
        while (file.good())
        {
            getline(file,s);    
            for ( wsregex_iterator it (s.begin(), s.end(), reg), it_end ; it != it_end ; ++it)
            {
                container[(wstring)(*it)[0]]++ ;
            }
        }
    
        cout <<"\nDone..."<< endl;
    
        fclose(fp);
    
        _wfopen_s (&fp, L"list.txt", L"w");
        _setmode (_fileno (fp), _O_U8TEXT);
        wofstream output(fp);
    
        for (auto item : container)
        {
            wcout<<item.first <<" : "<<item.second <<endl;
            //write output to list.txt
            output<<item.first <<" : "<<item.second <<endl;
        }
        fclose(fp);
        system("pause");
        return 0;
    }
    for a portable solution check this link out.
    Highlight Your Codes
    The Boost C++ Libraries (online Reference)

    "...a computer is a stupid machine with the ability to do incredibly smart things, while computer programmers are smart people with the ability to do incredibly stupid things. They are,in short, a perfect match.."
    Bill Bryson


Popular pages Recent additions subscribe to a feed

Similar Threads

  1. input does not work properly
    By mohsen in forum C++ Programming
    Replies: 1
    Last Post: 07-13-2013, 04:30 PM
  2. trying to get a few functions to work properly...
    By lemonwaffles in forum C++ Programming
    Replies: 3
    Last Post: 07-15-2009, 11:00 PM
  3. Can't get fgets to work properly
    By lonkz in forum C Programming
    Replies: 18
    Last Post: 01-03-2009, 01:43 PM
  4. Can't Get This Program To Work Properly
    By jbyers19 in forum C Programming
    Replies: 5
    Last Post: 03-09-2006, 10:59 PM
  5. GetAsyncKeyState() won't work properly!
    By SyntaxBubble in forum Windows Programming
    Replies: 1
    Last Post: 04-30-2002, 03:03 PM

Tags for this Thread