Thread: Fstream.read unicode file

Thread Tools
Search Thread
- Advanced Search
Display
- Linear Mode
- Switch to Hybrid Mode
- Switch to Threaded Mode

03-16-2015 #1
Ducky

View Profile

View Forum Posts
Registered User

Join Date

Dec 2007

Posts

932
Fstream.read unicode file
1. If I open a unicode file with notepad, I get: "Package Requires Reboot = no".

But If I read it with fstream, I get: "P a c k a g e R e q u i r e s R e b o o t = n o".

I found the reason and a solution for it in this article but I'm wondering if that is the right solution?

Code:

wstring ReadUTF16(const string & filename) { ifstream file(filename.c_str(), std::ios::binary); stringstream ss; ss << file.rdbuf() << '\0'; return wstring((wchar_t *)ss.str().c_str()); }

http://cfc.kizzx2.com/index.php/read...-in-windows-c/

Wouldn't this solution be faster because the '<<' operator copies byte by byte?

Code:

wstring ReadUTF16(char * filename) { ifstream ifs(filename, std::ios::binary); ifs.seekg(0, ifs.end); std::streampos length = ifs.tellg(); ifs.seekg(0, ifs.beg); std::vector<char> cVec(length); ifs.read(&cVec[0],length); return wstring((wchar_t *)&cVec[0]); }

2. And if I want to write a function to read both ANSI and Unicode should I check for the two first byte to know the encoding first?
Last edited by Ducky; 03-16-2015 at 03:45 AM.

Using Windows 10 with Code Blocks and MingW.
03-16-2015 #2
MOS-6581

View Profile

View Forum Posts
Registered User

Join Date

May 2014

Posts

121
If the text file contains UTF-16 code points then the correct solution is to use functions that read wchar_t data (wchar_t is 16 bits on Windows).

You can use wifstream instead of ifstream.
03-16-2015 #3
Ducky

View Profile

View Forum Posts
Registered User

Join Date

Dec 2007

Posts

932
So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?

And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?

Using Windows 10 with Code Blocks and MingW.
03-16-2015 #4
MOS-6581

View Profile

View Forum Posts
Registered User

Join Date

May 2014

Posts

121
Originally Posted by Ducky

So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?

Determining the encoding of a file is not easy. There are functions in Windows that can allow you to detect the encoding/code page through statistical analysis of the text file but in general you shouldn't try to detect that at all. Most programs simply go with whatever encoding/code page that is the default for the system if the encoding is unknown and then allow the user to manually reload the file with the correct encoding if needed.

Originally Posted by Ducky

And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?

If the text file contains Unicode characters outside of the ASCII range then it doesn't make any sense to convert the text to ASCII since those characters can't be represented as ASCII. Use WideCharToMultiByte and specify UTF-8 as the CodePage argument if you want the text as a sequence of 8 bit code points. UTF-8 is compatible with ASCII so if you know that your text file only contains characters from the ASCII set then you can treat the UTF-8 string you get from WideCharToMultiByte as ASCII.
03-16-2015 #5
Elysia

View Profile

View Forum Posts
C++まいる！Cをこわせ！

Join Date

Oct 2007

Location

Inside my computer

Posts

24,654
Originally Posted by Ducky

1. If I open a unicode file with notepad, I get: "Package Requires Reboot = no".

If the file is in UTF-16, then each character takes up at least two bytes, where for every character in the ASCII set is the same in the unicode set except the second byte is 0. So that explains why.

>>char * filename
Don't use char* for string literals! Use std::string.

Originally Posted by Ducky

2. And if I want to write a function to read both ANSI and Unicode should I check for the two first byte to know the encoding first?

Originally Posted by Ducky

So I determine first the encoding of the file and then I open it up with fstream or wfstream accordingly?

All files don't have encoding information. Files written in linux are very often UTF-8 without any byte marker indicating that it's unicode. Indeed, this is actually the recommended way, so determining the encoding is not always easy.

Originally Posted by Ducky

And then I use wcstombs(str, wstr, sizeof(str)); to translate it to ANSI?

Never use ANSI. Use UTF-8 internally and store it in a std::string. I've got code to convert between ANSI <--> Unicode, as well as UTF8 <--> UTF16. I can share it if you're interested.

Originally Posted by Adak

io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.

Originally Posted by Salem

You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

Outside of your DOS world, your header file is meaningless.

03-16-2015 #6

Ducky

Ducky is offline

Registered User

Originally Posted by Elysia

Never use ANSI. Use UTF-8 internally and store it in a std::string. I've got code to convert between ANSI <--> Unicode, as well as UTF8 <--> UTF16. I can share it if you're interested.

Yes please, I would be interested because with the CP_UTF8 parameter I get some weird characters and with CP_ACP it wont get translated at all, it stays Unicode.
Code:
        wchar_t * wbuf = new wchar_t [length];
        ifs.read(wbuf,length);

        char * buf = new char[length];
        int ret=WideCharToMultiByte(
                CP_ACP,
                0,
                wbuf,// wide-character string to be converted
                -1,  //this parameter can be set set to -1 if the string is null-terminated.
                buf, //buffer that receives the converted string.
                length, // output buf size
                NULL, NULL);

        string s(buf);
        cout << " \n" << buf << '\n';

Using Windows 10 with Code Blocks and MingW.

03-16-2015 #7
Ducky

View Profile

View Forum Posts
Registered User

Join Date

Dec 2007

Posts

932
Originally Posted by MOS-6581

Determining the encoding of a file is not easy. There are functions in Windows that can allow you to detect the encoding/code page through statistical analysis of the text file but in general you shouldn't try to detect that at all. Most programs simply go with whatever encoding/code page that is the default for the system if the encoding is unknown and then allow the user to manually reload the file with the correct encoding if needed.

I want to use std::find. So if the file is in Unicode it wont work if I compare 'hello' with 'h e l l o'.

Using Windows 10 with Code Blocks and MingW.
03-16-2015 #8
MOS-6581

View Profile

View Forum Posts
Registered User

Join Date

May 2014

Posts

121
Keep in mind that there's not a one-to-one mapping between UTF-8 and UTF-16. A character encoded as UTF-16 can take up more or less bytes when converted to UTF-8. To obtain the correct size of the buffer where the conversion is stored then you should first call WideCharToMultiByte like this:

Code:

int buffer_size_needed = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, string_to_convert, -1, NULL, 0, NULL, NULL);

03-16-2015 #9

Ducky

Ducky is offline

Registered User

Quote

Originally Posted by MOS-6581

View Post

Keep in mind that there's not a one-to-one mapping between UTF-8 and UTF-16. A character encoded as UTF-16 can take up more or less bytes when converted to UTF-8. To obtain the correct size of the buffer where the conversion is stored then you should first call WideCharToMultiByte like this:

Code:

int buffer_size_needed = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, string_to_convert, -1, NULL, 0, NULL, NULL);

Thanks a lot. I tried it that way but I still only get garbage chars.

Code:

     int buffer_size_needed = WideCharToMultiByte(CP_UTF8,
                                                     WC_ERR_INVALID_CHARS,
                                                     wbuf, //string_to_convert,
                                                     -1,
                                                     NULL,
                                                     0,
                                                     NULL,
                                                     NULL);

     int ret = WideCharToMultiByte(CP_UTF8,
                                    0,
                                    wbuf,// wide-character string to be converted
                                    -1,  //this parameter can be set set to -1 if the string is null-terminated.
                                    buf, //buffer that receives the converted string.
                                    buffer_size_needed, // output buf size
                                    NULL, NULL);
                    

     for(int i=0; i<length; i++) cout << (char)buf[i];

Last edited by Ducky; 03-16-2015 at 11:31 AM.

Using Windows 10 with Code Blocks and MingW.

03-16-2015 #10
MOS-6581

View Profile

View Forum Posts
Registered User

Join Date

May 2014

Posts

121
Show the entire code. Don't forget to check the return value from functions. It's possible that WideCharToMultiByte is failing for some reason. It returns 0 if it failed.

03-16-2015 #11

Ducky

Ducky is offline

Registered User

Thanks.

This is the .txt file I open:

Package Requires Reboot = no
New Uninstall Key = inf2
Progress Thread Wait = Success
[Finish]
[ResponseResult]
ResultCode = 0

It is read into wbuf I verified it. I also tried to attach it but the site says its an invalid file.

Code:

#include <Windows.h>
#include <locale>
#include <vector>
#include <string>
#include <sstream>
#include <fstream>
#include <iostream>     // std::cout
using namespace std;


int main()
{
    string szFilename = "IntelChipset.log";
    std::wifstream ifs(szFilename.c_str(), std::ifstream::binary);
    if(ifs)
    {
        // get length of file:
        ifs.seekg(0, ifs.end);
        std::streampos length = ifs.tellg();
        ifs.seekg(0, ifs.beg);

        wchar_t * wbuf = new wchar_t [length];

        cout << " Reading " << length << " characters... \n";
        // read data as a block:
        ifs.read(wbuf,length);

        wstring wstr;
        wstr.copy(wbuf,length,0);
        wcout << " wstr " << wstr << '\n';
        for(int i=0; i<length; i++) cout << (char)wstr[i];

        std::string::size_type pos = 0;
        if((pos = wstr.find(L"R e b o o t", pos)) != std::string::npos)
        {
            cout << "Found at "<< pos << '\n';
        }
        else
        {
           cout << " Not found " << '\n';
        }

        char * buf = new char[length];

        UINT CodePage = CP_UTF8;

        int buffer_size_needed = WideCharToMultiByte(
                                   CodePage,
                                   WC_ERR_INVALID_CHARS,
                                   wbuf, //string_to_convert,
                                   -1,
                                   NULL,
                                   0,
                                   NULL,
                                   NULL);

        cout << " buffer_size_needed " << buffer_size_needed << '\n';

        //int ret = wcstombs(buf, wbuf, length);
        int ret = WideCharToMultiByte(
                    CodePage,
                    0,
                    wbuf,
                    -1, 
                    buf,
                    buffer_size_needed,
                    NULL,
                    NULL);

        cout << " ret " << ret << '\n';

        string s(buf);
        
        cout << " \n string " << s << '\n';
    }
    else
    {
        cout << " Error opening " << szFilename << "\n";
        return 1;
    }
    ifs.close();

    return 0;
}

Last edited by Ducky; 03-16-2015 at 12:33 PM.

Using Windows 10 with Code Blocks and MingW.

03-16-2015 #12
whiteflags

View Profile

View Forum Posts
Lurking

Join Date

Apr 2006

Location

United States

Posts

9,612
Code:

"000000000 52 20 65 20 62 20 6F 20-6F 20 74 |R e b o o t |"

I don't know if it just didn't come through over the internet but this is not sufficient. You really need to append a true null to your wide character strings. You really do need to convert UTF-16↔ANSI internally (and UTF-32 on *nixes).

I don't know what Elysia's going to post so you might want to wait for him, but there is an okay solution surrounding basic_string:

Code:

typedef std::basic_string<uint16_t> u16string; u16string toUTF16(const std::string& ansi); std::string toANSI(const u16string& wide);

Just create new strings with suitable byte patterns.

Note that for file names and command line arguments you are probably going to want some help if those are wide strings as well. Use Boost.Filesystem, Qt, something.
03-16-2015 #13
Codeplug

View Profile

View Forum Posts
Registered User

Join Date

Mar 2003

Posts

4,981
Here's my version: How to set up libconv in visual c++

gg

How to Ask Questions The Smart Way
How to Report Bugs Effectively
The Only Correct Indent Style
03-16-2015 #14
MOS-6581

View Profile

View Forum Posts
Registered User

Join Date

May 2014

Posts

121
I did some testing myself and found some strange results that I can't explain. I first created a simple file in Notepad with the text "Hello" and then saved it as "Unicode". Windows reports that the file size is 12 bytes which is what I would expect since it's five characters + BOM.

Looking at the integer values of what I read with wifstream::read then I get this:
255 (BOM)
254 (BOM)
87
0
111
0

How the heck did I end up reading that? If I open the text file in Notepad again then I can still see "Hello" and the file size is still reported as 12 bytes.

EDIT: It seems that wifstream::read actually reads the file one byte at a time and then stores each byte in a wchar_t. In order words it's useless for reading UTF-16 on Windows.

Last edited by MOS-6581; 03-16-2015 at 03:19 PM.
03-16-2015 #15
Codeplug

View Profile

View Forum Posts
Registered User

Join Date

Mar 2003

Posts

4,981
Yes, wifstream will simply perform the wchar_t -> char conversion under the hood unless you replace the codecvt: Unicode text file

Another solution: https://msdn.microsoft.com/en-us/library/ee292208.aspx

And a MS CRT solution is to use the "ccs=ENCODING" extension: https://msdn.microsoft.com/en-us/library/yeby3zcb.aspx

gg

How to Ask Questions The Smart Way
How to Report Bugs Effectively
The Only Correct Indent Style

Quick Navigation C++ Programming Top

« Previous Thread | Next Thread »

Popular pages

Recent additions

How to create a shared library on Linux with GCC - December 30, 2011
Enum classes and nullptr in C++11 - November 27, 2011
Learn about The Hash Table - November 20, 2011
Rvalue References and Move Semantics in C++11 - November 13, 2011
C and C++ for Java Programmers - November 5, 2011
A Gentle Introduction to C++ IO Streams - October 10, 2011

Similar Threads

How to use fstream or istream to read a binary file and write out what's in it?

By Blah937 in forum C++ Programming

Replies: 3
Last Post: 08-04-2011, 05:11 PM
Read of fstream

By yappy in forum C++ Programming

Replies: 1
Last Post: 07-31-2010, 04:55 AM
Read file with unicode

By Ducky in forum C++ Programming

Replies: 8
Last Post: 07-18-2010, 12:00 AM
Using fstream to read from file

By csteinsv in forum C++ Programming

Replies: 0
Last Post: 06-02-2009, 01:52 PM
how can i read a unicode char from a file ?

By Meshal in forum C Programming

Replies: 6
Last Post: 10-19-2007, 03:27 AM

All times are GMT -6. The time now is 03:16 PM.

Powered by vBulletin® Version 4.2.5
Copyright © 2024 vBulletin Solutions Inc. All rights reserved.

Search Engine Optimisation provided by DragonByte SEO v2.0.40 (Pro) - vBulletin Mods & Addons Copyright © 2024 DragonByte Technologies Ltd.