Fstream.read unicode file

**MOS-6581** · 03-16-2015

What's the alternative to fread if you just want to read sizeof(wchar_t) bytes at a time?

**Elysia** · 03-16-2015

Here's my library, in case anyone is interested.
If you're only interested in the actual conversion functions, they're at the bottom.
I tend to use containers to keep "remember" the encoding of some text in my applications, because many things produce ASCII and some (e.g. Windows API, Boost, etc) consume UTF-16, while I always use UTF-8 internally.
So the usage is pretty easy. Just wrap your literals in an encoding, e.g.

"Hello World"_u8
"Hello World"_u16
"Hello World"_ans

or

u8("Hello World")
u16("Hello World")
ans("Hello World")

If you assign e.g. a u16 variable to a u8 variable, conversion UTF16 -> UTF8 is performed automatically. You cannot append variables of different types because implicit conversion could be costly, so you have to explicitly convert all variables to be of the same type. To access the actual string hosted in the container, dereference it (e.g. use *). To access members, use -> (e.g. myvar->size()).

Code:

#pragma once

#include <string>
#include <type_traits>
#include <Windows.h>

namespace stf
{
	namespace Errors
	{
		enum class XErrorType
		{
			FailedToConvertString,
		};

		class XBaseError: public std::runtime_error
		{
		public:
			XBaseError(): std::runtime_error("") {}
			virtual void ThrowIfError() const = 0;
			virtual XErrorType GetErrorType() const = 0;
			explicit operator bool() const { return IsErrorImpl(); }

		private:
			virtual bool IsErrorImpl() const = 0;
		};

		class XNone {};
		
		template<int N, XErrorType ErrorType, typename Base = XNone>
		class XError: public XBaseError, public Base
		{
		public:
			using XBaseError::XBaseError;

		private:
			using This_t = XError<N, ErrorType, Base>;
			using NoError_t = XError<-1, ErrorType, Base>;

			void ThrowIfErrorImpl(std::true_type) const {}
			void ThrowIfErrorImpl(std::false_type) const { throw *this; }

			virtual bool IsErrorImpl() const { return IsErrorImpl(std::is_same<This_t, NoError_t>()); }
			bool IsErrorImpl(std::true_type) const { return false; }
			bool IsErrorImpl(std::false_type) const { return true; }

		public:
			virtual void ThrowIfError() const { ThrowIfErrorImpl(std::is_same<This_t, NoError_t>()); }
			explicit operator bool() const { return IsErrorImpl(); }
			virtual XErrorType GetErrorType() const { return ErrorType; }
			XError() {}
		};

		using FailedToConvertString = XError<2, XErrorType::FailedToConvertString>;
	}
}

namespace stf
{
	template<typename T, int N> class XGenericContainer;
}

using ans = stf::XGenericContainer<std::string, 0>;
using u8 = stf::XGenericContainer<std::string, 1>;
using u16 = stf::XGenericContainer<std::wstring, 2>;

namespace stf
{
	template<typename T, int N>
	class XGenericContainer
	{
	public:
		XGenericContainer() {}
		explicit XGenericContainer(const char* str): m_v(str) {}
		explicit XGenericContainer(const wchar_t* str): m_v(str) {}
		explicit XGenericContainer(const std::string& str): m_v(str) {}
		explicit XGenericContainer(const std::wstring& str): m_v(str) {}

		template<typename OtherT, int OtherN>
		explicit XGenericContainer(const XGenericContainer<OtherT, OtherN>& r):
			m_v(Convert(this, r))
		{}
		
		template<typename OtherT, int OtherN>
		explicit XGenericContainer(XGenericContainer<OtherT, OtherN>&& r):
			m_v( std::move(Convert(this, r)) )
		{ }

		~XGenericContainer()
		{
			int v = 0;
			v = v;
		}

		friend XGenericContainer operator + (const XGenericContainer& l, const XGenericContainer& r) { return XGenericContainer(*l + *r); }
		XGenericContainer& operator += (const XGenericContainer& that) { m_v += *that; return *this; }
		
		friend bool operator < (const XGenericContainer& l, const XGenericContainer& r) { return *l < *r; }
		friend bool operator == (const XGenericContainer& l, const XGenericContainer& r) { return *l == *r; }
		friend bool operator != (const XGenericContainer& l, const XGenericContainer& r) { return *l != *r; }

		template<typename OtherT, int OtherN>
		XGenericContainer& operator = (const XGenericContainer<OtherT, OtherN>& r)
		{
			m_v = Convert(this, r);
			return *this;
		}

		template<typename OtherT, int OtherN>
		XGenericContainer& operator = (XGenericContainer<OtherT, OtherN>&& r)
		{
			m_v = std::move(Convert(this, r));
			return *this;
		}

		const T& operator * () const { return m_v; }
		T& operator * () { return m_v; }
		const T* operator -> () const { return &m_v; }
		T* operator -> () { return &m_v; }
		auto& operator [] (size_t idx) { return m_v[idx]; }
		auto operator [] (size_t idx) const { return m_v[idx]; }

		template <class Archive>
		T save_minimal(const Archive&) const
		{
			return m_v;
		}

		template <class Archive>
		void load_minimal(const Archive&, const T& value)
		{
			m_v = value;
		}

		std::string ToUtf8() const { return *u8(*this); }
		std::wstring ToUtf16() const { return *u16(*this); }

	private:
		template<typename OtherT> std::string Convert(const u8*, const OtherT& that) { return *stf::ToUtf8(that); }
		const std::string& Convert(const u8*, const u8& that) { return that.m_v; }
		std::string& Convert(const u8*, u8& that) { return that.m_v; }

		template<typename OtherT> std::wstring Convert(const u16*, const OtherT& that) { return *stf::ToUtf16(that); }
		const std::wstring& Convert(const u16*, const u16& that) { return that.m_v; }
		std::wstring& Convert(const u16*, u16& that) { return that.m_v; }

		template<typename OtherT> std::string Convert(const ans*, const OtherT& that) { static_assert(!std::is_same<OtherT, OtherT>::value, "Cannot convert to an ansi string."); }
		const std::string& Convert(const ans*, const ans& that) { return that.m_v; }
		std::string& Convert(const ans*, ans& that) { return that.m_v; }

		T m_v;
	};

	inline u16 AToU16(const std::string& From)
	{
		if (From.empty())
			return u16(L"");
		auto Len = MultiByteToWideChar(CP_ACP, 0, From.data(), (int)From.size(), nullptr, 0);
		if (Len > 0)
		{
			std::wstring Str(Len, L'\0');
			MultiByteToWideChar(CP_ACP, 0, From.data(), (int)From.size(), &Str[0], Len);
			return u16(Str);
		}
		throw stf::Errors::FailedToConvertString();
	}

	inline u8 U16ToU8(const std::wstring& From)
	{
		if (From.empty())
			return u8("");
		auto Len = WideCharToMultiByte(CP_UTF8, 0, From.data(), (int)From.size(), nullptr, 0, 0, 0);
		if (Len > 0)
		{
			std::string Str(Len, L'\0');
			WideCharToMultiByte(CP_UTF8, 0, From.data(), (int)From.size(), &Str[0], Len, 0, 0);
			return u8(Str);
		}
		throw stf::Errors::FailedToConvertString();
	}

	inline u16 U8ToU16(const std::string& From)
	{
		if (From.empty())
			return u16(L"");
		auto Len = MultiByteToWideChar(CP_UTF8, 0, From.data(), (int)From.size(), nullptr, 0);
		if (Len > 0)
		{
			std::wstring Str(Len, L'\0');
			MultiByteToWideChar(CP_UTF8, 0, From.data(), (int)From.size(), &Str[0], Len);
			return u16(Str);
		}
		throw stf::Errors::FailedToConvertString();
	}

	inline u8 AToU8(const std::string& From)
	{
		return U16ToU8(*AToU16(From));
	}

	inline u8 ToUtf8(const ans& From)
	{
		return AToU8(*From);
	}

	inline u8 ToUtf8(const u8& From)
	{
		return From;
	}

	inline u8 ToUtf8(const u16& From)
	{
		return U16ToU8(*From);
	}

	inline u16 ToUtf16(const ans& From)
	{
		return AToU16(*From);
	}

	inline u16 ToUtf16(const u8& From)
	{
		return U8ToU16(*From);
	}

	inline u16 ToUtf16(const u16& From)
	{
		return From;
	}
}

inline u8 operator ""_u8(const char* str, size_t) { return u8(str); }
inline u8 operator ""_u8(const wchar_t* str, size_t) { return stf::U16ToU8(str); }
inline u16 operator ""_u16(const wchar_t* str, size_t) { return u16(str); }

**Codeplug** · 03-16-2015

>> What's the alternative to fread if you just want to read sizeof(wchar_t) bytes at a time?
Open the file in binary mode and use fread (assuming host byte-order and host wchar_t size is used in the file).

gg

**Ducky** · 03-17-2015

Thanks for all the help and code posted from everybody, highly appreciated.

So after all we shouldnt use wifstream at all. Thus the solution is in my very first post and we dont even have to use converting functions because its already converted to ASCII.

There is only the first character to be removed if it bothers in any way which will be 0xffffffff = -1. Or it could also help to determine if the file is ASCII or Unicode.

**Elysia** · 03-17-2015

Originally Posted by Ducky

So after all we shouldnt use wifstream at all.

Indeed. We should not. Wide streams are broken and the committee doesn't seem to want to fix them.

Originally Posted by Ducky

Thus the solution is in my very first post and we dont even have to use converting functions because its already converted to ASCII.

That's wrong. chars in C++ is encoding agnostic (to an extent, so is wchar_t). chars are typically used to hold both ASCII and UTF8. And I will repeat this: never ever use ASCII. Always use UTF8.
What narrow streams do is read whatever raw contents there is in the file. It does not read ASCII. You saw this as you tried to read UTF16. Basically there would be ASCII interleaved with 0 bytes. But that's really because that's how the UTF16 encoding works for all characters present in the ASCII character encoding. They're identical in unicode except one byte is 0. You also saw that reading UTF16 is a PITA™. The best way to deal with unicode is to simply use UTF-8. Write UTF-8, read UTF-8 and store UTF-8 internally. All algorithms for ASCII works for UTF-8. You can also generate a locale for UTF8 to input in e.g. boost for dealing with unicode.

**Ducky** · 03-17-2015

Originally Posted by Elysia

That's wrong. chars in C++ is encoding agnostic (to an extent, so is wchar_t). chars are typically used to hold both ASCII and UTF8. And I will repeat this: never ever use ASCII. Always use UTF8.

Sorry Elysia but I don't understand what you mean by that. I just want to read a file and want it to be "hello" instead of "h e l l o". I don't see what does it have to do with using UTF8 since I didn't create the file myself.

**Ducky** · 03-17-2015

What's wrong with this solution?

First I will determine if the file is binary or text file and then:

Code:

int main()
{
    /// Is File Unicode or ASCII?
    string filename = "IntelChipset.log";
    string str;
    ifstream ifs(filename.c_str(), std::ios::binary);
    if(!ifs)
    {
        cout << " Error opening " << filename << "\n";
        return 1;
    }
    ifs.seekg(0, ifs.end);
    std::streampos length = ifs.tellg();
    ifs.seekg(0, ifs.beg);
    cout << " Size: " << length << "\n";
    std::vector<char> cVec(length);
    ifs.read(&cVec[0], length);
    ifs.close();

    if(cVec[0] == -1 && cVec[1] == -2)
    {
        cout << " File is Unicode " << "\n";
        wstring wstr = wstring((wchar_t *)&cVec[0]);
        string s(wstr.begin(),wstr.end());
        str=s;

        cout << str << " \n";
    }
    else
    {
        cout << " File is ASCII " << "\n";
     
        str=&cVec[0];
        cout << str << " \n";
    }
    return 0;
}

**Elysia** · 03-17-2015

>>ifstream ifs(filename.c_str(), std::ios::binary);
c_str() is not necessary.

>>if(cVec[0] == -1 && cVec[1] == -2)
Unicode files are not guaranteed to have a byte order marker. Plus there's no guarantee that you haven't encountered a file with the data 0xFFFE as the two first bytes.

>>string s(wstr.begin(),wstr.end());
This assumes that inside the file, all characters are ASCII characters. But that's not guaranteed. Unicode was made to support international characters, not just english. If, for example, the file contains japanese, then this fails spectacularly. Either you use a wstring or you do a unicode conversion UTF16 -> UTF8 and store the UTF8 converted string in a std::string.

>>str=&cVec[0];
This is a compile error: cannot convert std::string* to std::string.

**Ducky** · 03-17-2015

Thanks for the input Elysia!

Originally Posted by Elysia

>>if(cVec[0] == -1 && cVec[1] == -2)
Unicode files are not guaranteed to have a byte order marker. Plus there's no guarantee that you haven't encountered a file with the data 0xFFFE as the two first bytes.

That would be a bit better:

Code:

   
 if( (cVec[0] == -1 && cVec[1] == -2 && __isascii(cVec[2])) ||
         (cVec[1] == '\0' && cVec[3] == '\0' ))

>>string s(wstr.begin(),wstr.end());
This assumes that inside the file, all characters are ASCII characters. But that's not guaranteed. Unicode was made to support international characters, not just english. If, for example, the file contains japanese, then this fails spectacularly. Either you use a wstring or you do a unicode conversion UTF16 -> UTF8 and store the UTF8 converted string in a std::string.

You are probably right but apparently it doesnt fail with the two BOM characters that aren't ASCII.

But I'll try to convert it rather if you say so.

**Elysia** · 03-17-2015

Originally Posted by Ducky

That would be a bit better:

No. Stop trying to detect file encoding. You're doing it wrong anyway.

Originally Posted by Ducky

You are probably right but apparently it doesnt fail with the two BOM characters that aren't ASCII.

It really depends on what you mean by fail. If the BOM is not there, of course you won't have that problem. If it's there, then you will get a "garbage character" as your first character.

Originally Posted by Ducky

But I'll try to convert it rather if you say so.

Assume a file encoding or let the user choose. If it's a determined format, just use that encoding. Otherwise choose UTF8 as your own file encoding.
I also recommend you convert any data in UTF16 to UTF8. UTF8 is more compatible with APIs (don't forget to use Windows wide APIs, though, since their narrow APIs are only for ASCII; also, the wide versions take UTF16, so convert to that before passing it to the APIs).

**Ducky** · 03-17-2015

Originally Posted by Elysia

Assume a file encoding or let the user choose. If it's a determined format, just use that encoding. Otherwise choose UTF8 as your own file encoding.
I also recommend you convert any data in UTF16 to UTF8. UTF8 is more compatible with APIs (don't forget to use Windows wide APIs, though, since their narrow APIs are only for ASCII; also, the wide versions take UTF16, so convert to that before passing it to the APIs).

So I shouldnt detect file encoding just convert every file I open to UTF8?

Yeah but if you dont know what it is to begin with how do you convert it?

How does Windows know, when it searches for file content, what is the file encoding?

**MOS-6581** · 03-17-2015

You can statistically analyze files to determine the encoding but it's not 100% reliable or something you should do yourself. The best way to do it is to just read the file as it is and then have the user specify the encoding when needed. You don't need to convert it to UTF-8 because you're already assuming that it's encoded as UTF-8 or some compatible encoding. This works well because most files are just plain old ASCII which is compatiable with UTF-8.

What Elysia is saying is that you should keep strings in the UTF-8 format internally and then only change the encoding when the outside worlds demands another encoding. This is the most portable way to write programs. This only makes a difference when you actually know the encoding of the strings that you're working with. Unicode strings in the Windows API are encoded as UTF-16 so all your code that interfaces with that API should convert strings between UTF-8 and UTF-16 as needed. Internally you keep the strings as UTF-8 so you can use the same functions for all your string processing and then you change the encoding back to UTF-16 if you need to give the string back to Windows (if you're opening a file or something).

**whiteflags** · 03-17-2015

Originally Posted by Ducky

So I shouldnt detect file encoding just convert every file I open to UTF8?

Yeah but if you dont know what it is to begin with how do you convert it?

How does Windows know, when it searches for file content, what is the file encoding?

Well, before Elysia's suggestion dominates the thread I do have my own opinion on it. Do you actually know what the encoding is? People have guessed in the thread that the encoding is UTF-16, but you never confirmed this. It may be easiest to choose to always use UTF-16, especially since the file appears to be in that format.

You can certainly do UTF-16→UTF-8 but I think that would be a waste of time, unless you want to display output on the screen. The reason that works is because an ASCII string like "reboot" is automatically also UTF-8 encoded: ASCII is a subset of UTF-8. So if you knew a priori that the file is ostensibly ASCII-English, then using a UTF-8 string on std::cout, etc is a no brainer.

**Codeplug** · 03-17-2015

The only analysis I bother with is looking for a BOM.

No BOM on Windows: Assume ACP encoded
No BOM on *nix: Assume UTF8 encoded

gg

**Elysia** · 03-17-2015

Originally Posted by Ducky

So I shouldnt detect file encoding just convert every file I open to UTF8?

Yeah but if you dont know what it is to begin with how do you convert it?

How does Windows know, when it searches for file content, what is the file encoding?

You should know the file encoding of the file you're trying to open. Think about it for a moment. Is it a file you get from somewhere? If so, then that "somewhere" must know the encoding. Is it a static one-time file? Then you can probably open it and figure it out. Is it your own file? Then you should decide on the encoding. Then just program the logic to use the encoding you know the file has. You should avoid trying to determine it via code. It's complicated, and you may just get it wrong.

I'm just saying that for your own files, just use UTF8.

Originally Posted by whiteflags

You can certainly do UTF-16→UTF-8 but I think that would be a waste of time, unless you want to display output on the screen. The reason that works is because an ASCII string like "reboot" is automatically also UTF-8 encoded: ASCII is a subset of UTF-8. So if you knew a priori that the file is ostensibly ASCII-English, then using a UTF-8 string on std::cout, etc is a no brainer.

Experience has taught me that anything that involves wide characters is broken. So yeah, you really DO want to use UTF8 internally. But UTF8 is not bullet either: cout'ing to a console in windows with unicode chars will still cause problems because unicode chars won't show properly. But still, you can dump UTF8 to a file directly via narrow streams and read it back without problems. You can't do that with UTF16.