Thread: Fstream.read unicode file

  1. #1
    Registered User
    Join Date
    Dec 2007
    Posts
    932

    Fstream.read unicode file

    1. If I open a unicode file with notepad, I get: "Package Requires Reboot = no".

    But If I read it with fstream, I get: "P a c k a g e R e q u i r e s R e b o o t = n o".

    I found the reason and a solution for it in this article but I'm wondering if that is the right solution?

    Code:
    wstring ReadUTF16(const string & filename) { ifstream file(filename.c_str(), std::ios::binary); stringstream ss; ss << file.rdbuf() << '\0'; return wstring((wchar_t *)ss.str().c_str()); }
    http://cfc.kizzx2.com/index.php/read...-in-windows-c/

    Wouldn't this solution be faster because the '<<' operator copies byte by byte?
    Code:
    wstring ReadUTF16(char * filename)
    {
        ifstream ifs(filename, std::ios::binary);
        ifs.seekg(0, ifs.end);
        std::streampos length = ifs.tellg();
        ifs.seekg(0, ifs.beg);
      
        std::vector<char> cVec(length);
        ifs.read(&cVec[0],length);
        return wstring((wchar_t *)&cVec[0]);
    }

    2. And if I want to write a function to read both ANSI and Unicode should I check for the two first byte to know the encoding first?
    Last edited by Ducky; 03-16-2015 at 03:45 AM.
    Using Windows 10 with Code Blocks and MingW.

  2. #2
    Registered User
    Join Date
    May 2014
    Posts
    121
    If the text file contains UTF-16 code points then the correct solution is to use functions that read wchar_t data (wchar_t is 16 bits on Windows).

    You can use wifstream instead of ifstream.

  3. #3
    Registered User
    Join Date
    Dec 2007
    Posts
    932
    So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?

    And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?
    Using Windows 10 with Code Blocks and MingW.

  4. #4
    Registered User
    Join Date
    May 2014
    Posts
    121
    Quote Originally Posted by Ducky View Post
    So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?
    Determining the encoding of a file is not easy. There are functions in Windows that can allow you to detect the encoding/code page through statistical analysis of the text file but in general you shouldn't try to detect that at all. Most programs simply go with whatever encoding/code page that is the default for the system if the encoding is unknown and then allow the user to manually reload the file with the correct encoding if needed.

    Quote Originally Posted by Ducky View Post
    And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?
    If the text file contains Unicode characters outside of the ASCII range then it doesn't make any sense to convert the text to ASCII since those characters can't be represented as ASCII. Use WideCharToMultiByte and specify UTF-8 as the CodePage argument if you want the text as a sequence of 8 bit code points. UTF-8 is compatible with ASCII so if you know that your text file only contains characters from the ASCII set then you can treat the UTF-8 string you get from WideCharToMultiByte as ASCII.

  5. #5
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by Ducky View Post
    1. If I open a unicode file with notepad, I get: "Package Requires Reboot = no".
    If the file is in UTF-16, then each character takes up at least two bytes, where for every character in the ASCII set is the same in the unicode set except the second byte is 0. So that explains why.

    >>char * filename
    Don't use char* for string literals! Use std::string.

    Quote Originally Posted by Ducky View Post
    2. And if I want to write a function to read both ANSI and Unicode should I check for the two first byte to know the encoding first?
    Quote Originally Posted by Ducky View Post
    So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?
    All files don't have encoding information. Files written in linux are very often UTF-8 without any byte marker indicating that it's unicode. Indeed, this is actually the recommended way, so determining the encoding is not always easy.

    Quote Originally Posted by Ducky View Post
    And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?
    Never use ANSI. Use UTF-8 internally and store it in a std::string. I've got code to convert between ANSI <--> Unicode, as well as UTF8 <--> UTF16. I can share it if you're interested.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Registered User
    Join Date
    Dec 2007
    Posts
    932
    Quote Originally Posted by Elysia View Post
    Never use ANSI. Use UTF-8 internally and store it in a std::string. I've got code to convert between ANSI <--> Unicode, as well as UTF8 <--> UTF16. I can share it if you're interested.
    Yes please, I would be interested because with the CP_UTF8 parameter I get some weird characters and with CP_ACP it wont get translated at all, it stays Unicode.

    Code:
            wchar_t * wbuf = new wchar_t [length];
            ifs.read(wbuf,length);
    
            char * buf = new char[length];
            int ret=WideCharToMultiByte(
                    CP_ACP,
                    0,
                    wbuf,// wide-character string to be converted
                    -1,  //this parameter can be set set to -1 if the string is null-terminated.
                    buf, //buffer that receives the converted string.
                    length, // output buf size
                    NULL, NULL);
    
            string s(buf);
            cout << " \n" << buf << '\n';
    Using Windows 10 with Code Blocks and MingW.

  7. #7
    Registered User
    Join Date
    Dec 2007
    Posts
    932
    Quote Originally Posted by MOS-6581 View Post
    Determining the encoding of a file is not easy. There are functions in Windows that can allow you to detect the encoding/code page through statistical analysis of the text file but in general you shouldn't try to detect that at all. Most programs simply go with whatever encoding/code page that is the default for the system if the encoding is unknown and then allow the user to manually reload the file with the correct encoding if needed.
    I want to use std::find. So if the file is in Unicode it wont work if I compare 'hello' with 'h e l l o'.
    Using Windows 10 with Code Blocks and MingW.

  8. #8
    Registered User
    Join Date
    May 2014
    Posts
    121
    Keep in mind that there's not a one-to-one mapping between UTF-8 and UTF-16. A character encoded as UTF-16 can take up more or less bytes when converted to UTF-8. To obtain the correct size of the buffer where the conversion is stored then you should first call WideCharToMultiByte like this:

    Code:
    int buffer_size_needed = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, string_to_convert, -1, NULL, 0, NULL, NULL);

  9. #9
    Registered User
    Join Date
    Dec 2007
    Posts
    932
    Quote Originally Posted by MOS-6581 View Post
    Keep in mind that there's not a one-to-one mapping between UTF-8 and UTF-16. A character encoded as UTF-16 can take up more or less bytes when converted to UTF-8. To obtain the correct size of the buffer where the conversion is stored then you should first call WideCharToMultiByte like this:

    Code:
    int buffer_size_needed = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, string_to_convert, -1, NULL, 0, NULL, NULL);
    Thanks a lot. I tried it that way but I still only get garbage chars.
    Code:
         int buffer_size_needed = WideCharToMultiByte(CP_UTF8,
                                                         WC_ERR_INVALID_CHARS,
                                                         wbuf, //string_to_convert,
                                                         -1,
                                                         NULL,
                                                         0,
                                                         NULL,
                                                         NULL);
    
         int ret = WideCharToMultiByte(CP_UTF8,
                                        0,
                                        wbuf,// wide-character string to be converted
                                        -1,  //this parameter can be set set to -1 if the string is null-terminated.
                                        buf, //buffer that receives the converted string.
                                        buffer_size_needed, // output buf size
                                        NULL, NULL);
                        
    
         for(int i=0; i<length; i++) cout << (char)buf[i];
    Last edited by Ducky; 03-16-2015 at 11:31 AM.
    Using Windows 10 with Code Blocks and MingW.

  10. #10
    Registered User
    Join Date
    May 2014
    Posts
    121
    Show the entire code. Don't forget to check the return value from functions. It's possible that WideCharToMultiByte is failing for some reason. It returns 0 if it failed.

  11. #11
    Registered User
    Join Date
    Dec 2007
    Posts
    932
    Thanks.

    This is the .txt file I open:
    Package Requires Reboot = no
    New Uninstall Key = inf2
    Progress Thread Wait = Success
    [Finish]
    [ResponseResult]
    ResultCode = 0
    It is read into wbuf I verified it. I also tried to attach it but the site says its an invalid file.

    Code:
    #include <Windows.h>
    #include <locale>
    #include <vector>
    #include <string>
    #include <sstream>
    #include <fstream>
    #include <iostream>     // std::cout
    using namespace std;
    
    
    int main()
    {
        string szFilename = "IntelChipset.log";
        std::wifstream ifs(szFilename.c_str(), std::ifstream::binary);
        if(ifs)
        {
            // get length of file:
            ifs.seekg(0, ifs.end);
            std::streampos length = ifs.tellg();
            ifs.seekg(0, ifs.beg);
    
            wchar_t * wbuf = new wchar_t [length];
    
            cout << " Reading " << length << " characters... \n";
            // read data as a block:
            ifs.read(wbuf,length);
    
            wstring wstr;
            wstr.copy(wbuf,length,0);
            wcout << " wstr " << wstr << '\n';
            for(int i=0; i<length; i++) cout << (char)wstr[i];
    
            std::string::size_type pos = 0;
            if((pos = wstr.find(L"R e b o o t", pos)) != std::string::npos)
            {
                cout << "Found at "<< pos << '\n';
            }
            else
            {
               cout << " Not found " << '\n';
            }
    
            char * buf = new char[length];
    
            UINT CodePage = CP_UTF8;
    
            int buffer_size_needed = WideCharToMultiByte(
                                       CodePage,
                                       WC_ERR_INVALID_CHARS,
                                       wbuf, //string_to_convert,
                                       -1,
                                       NULL,
                                       0,
                                       NULL,
                                       NULL);
    
            cout << " buffer_size_needed " << buffer_size_needed << '\n';
    
            //int ret = wcstombs(buf, wbuf, length);
            int ret = WideCharToMultiByte(
                        CodePage,
                        0,
                        wbuf,
                        -1, 
                        buf,
                        buffer_size_needed,
                        NULL,
                        NULL);
    
            cout << " ret " << ret << '\n';
    
            string s(buf);
            
            cout << " \n string " << s << '\n';
        }
        else
        {
            cout << " Error opening " << szFilename << "\n";
            return 1;
        }
        ifs.close();
    
        return 0;
    }
    Last edited by Ducky; 03-16-2015 at 12:33 PM.
    Using Windows 10 with Code Blocks and MingW.

  12. #12
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    Code:
    "000000000  52 20 65 20 62 20 6F 20-6F 20 74                  |R e b o o t     |"
    I don't know if it just didn't come through over the internet but this is not sufficient. You really need to append a true null to your wide character strings. You really do need to convert UTF-16↔ANSI internally (and UTF-32 on *nixes).

    I don't know what Elysia's going to post so you might want to wait for him, but there is an okay solution surrounding basic_string:
    Code:
    typedef std::basic_string<uint16_t> u16string;
    
    u16string toUTF16(const std::string& ansi);
    std::string toANSI(const u16string& wide);
    Just create new strings with suitable byte patterns.

    Note that for file names and command line arguments you are probably going to want some help if those are wide strings as well. Use Boost.Filesystem, Qt, something.

  13. #13

  14. #14
    Registered User
    Join Date
    May 2014
    Posts
    121
    I did some testing myself and found some strange results that I can't explain. I first created a simple file in Notepad with the text "Hello" and then saved it as "Unicode". Windows reports that the file size is 12 bytes which is what I would expect since it's five characters + BOM.

    Looking at the integer values of what I read with wifstream::read then I get this:
    255 (BOM)
    254 (BOM)
    87
    0
    111
    0

    How the heck did I end up reading that? If I open the text file in Notepad again then I can still see "Hello" and the file size is still reported as 12 bytes.

    EDIT: It seems that wifstream::read actually reads the file one byte at a time and then stores each byte in a wchar_t. In order words it's useless for reading UTF-16 on Windows.
    Last edited by MOS-6581; 03-16-2015 at 03:19 PM.

  15. #15
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Yes, wifstream will simply perform the wchar_t -> char conversion under the hood unless you replace the codecvt: Unicode text file

    Another solution: https://msdn.microsoft.com/en-us/library/ee292208.aspx

    And a MS CRT solution is to use the "ccs=ENCODING" extension: https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 3
    Last Post: 08-04-2011, 05:11 PM
  2. Read of fstream
    By yappy in forum C++ Programming
    Replies: 1
    Last Post: 07-31-2010, 04:55 AM
  3. Read file with unicode
    By Ducky in forum C++ Programming
    Replies: 8
    Last Post: 07-18-2010, 12:00 AM
  4. Using fstream to read from file
    By csteinsv in forum C++ Programming
    Replies: 0
    Last Post: 06-02-2009, 01:52 PM
  5. how can i read a unicode char from a file ?
    By Meshal in forum C Programming
    Replies: 6
    Last Post: 10-19-2007, 03:27 AM