Reading from a UTF-8 text file

**PaulBlay** · 05-26-2009

I've got a UTF-8 format (no BOM) text file that I want to read from.

I open it with

Code:

	_tfopen_s(&f->fh, buf, "rb, ccs=UTF-8");

and everything looks OK to that point. Later I try to get a wide(?)character from it with

Code:

	_TINT i = _fgettc(f->fh);

but it seems to have forgotten that the file is in UTF-8 and is instead treating it like it is in UTF-16(?).

I guess I could just take one byte at a time with fgetc and post-process it to cope with multi-byte characters but is that the way you're supposed to do it or have I missed something?

I was expecting that the 'get' function would take as many bytes as needed based on the Unicode encoding used (UTF-8) and put them together for me. Otherwise what's the point of telling it to open as UTF-8 in the first place?

**Ronix** · 05-26-2009

As far as I know, _fgettc will expand to either fgetwc or fgetc (Based on whether _UNICODE is defined or not), so yes, you'll most likely have to read single bytes and assemble them yourself.

Originally Posted by PaulBlay

I was expecting that the 'get' function would take as many bytes as needed based on the Unicode encoding used (UTF-8) and put them together for me.

That's what it does if _fgettc is fgetwc. The problem, I think, is that it can only "put them together" to form an UTF-16 character (Technically UCS-2)

**Codeplug** · 05-26-2009

In addition, don't bother yourself with _fgettc or anything to do with TCHAR's. If you using Unicode on Windows, use wchar_t and functions using wchar_t directly.

>> (Technically UCS-2)
That is true for 98/ME, but Win2K and later platforms support true UTF16LE.
cboard.cprogramming.com/c-programming/104360-size-%5B-bytes%5D-unicode-string.htm

Under Windows, wchar_t's are encoded as UTF16LE code points. So fgetwc() will return a UTF16LE code point by converting the UTF8 file contents. Remember to use a wint_t data type with fgetwc() so you can compare with WEOF correctly.

Also keep in mind that since you are using binary mode, if you encounter a file with a UTF8 BOM it will not be skipped over for you. So you may want to check if the first 3 bytes are EF BB BF.

gg

**PaulBlay** · 05-27-2009

Thanks for the feedback, both of you.

Originally Posted by Codeplug

In addition, don't bother yourself with _fgettc or anything to do with TCHAR's. If you using Unicode on Windows, use wchar_t and functions using wchar_t directly.

I am, and I know that I could, but I do have my reasons. Basically I thought they would be handy macros to redefine when/if I get to the point of looking at portability to other platforms.

That is true for 98/ME

Which I have no intention of touching. As I understand it support of Unicode was patchy at best for the Windows API at that point and I'm going to have enough trouble with everything else I need to do.

Under Windows, wchar_t's are encoded as UTF16LE code points. So fgetwc() will return a UTF16LE code point by converting the UTF8 file contents.

That's the bit I don't quite get. Just because the output is wint_t (for later wchar_t use) doesn't mean that the function couldn't have been made to read the one to three UTF8 bytes needed to produce the right Unicode codepoint. But I understand that the world is as it is so there's no point arguing about how I think things should be.

Also keep in mind that since you are using binary mode, if you encounter a file with a UTF8 BOM it will not be skipped over for you. So you may want to check if the first 3 bytes are EF BB BF.

The files used are managed my side so that shouldn't be a problem. I went for UTF-8 (no BOM) in the first place as some things I'd read suggested that would be more convenient for eventual portability issues. I think I'll probably just go for UTF-16 now, though.

**Codeplug** · 05-27-2009

>> Basically I thought they would be handy macros to redefine when/if I get to the point of looking at portability ...
Well, as you may already know, the "TCHAR mechanism" allows you to write code that could be compiled for either Unicode (UTF16LE) or SBCS or MBCS. SBCS is just your good 'ole codepage character set - 8 bits representing a glyph, where the codepage determines which glyph to use for each 8 bit value. MBCS can use multiple 8 bit values to represent a single glyph - like Shift-JIS or GBK.

If you have no need to support codepages, then use straight wchar_t's. This means your app will not be able to target 9x platforms (unless you link against MSLU), and your code won't support multi-byte codepages. 9x platforms are dead and Unicode is preferred over MBCS for new development. Thus my "forget TCHAR's" advocacy.

>> That's the bit I don't quite get. Just because ...
Didn't follow that second sentence. The point of "ccs=UTF-8" is so the CRT knows how to convert <file content encoding> => UTF16LE (wchar_t). Using the css mode switch will also automatically mark the stream as wide-oriented - so using byte-oriented functions on the stream (fgetc, etc...) is an error. You could use fread(), but that will just read wchar_t values as bytes (the underlying CRT code always does the conversion to UTF16LE).

>> ... the one to three UTF8 bytes needed to produce the right Unicode codepoint.
With the current Unicode standard, you can have up to 4 UTF8 bytes to represent a single Unicode character (the RFC allows up to 6 bytes for future expansion). The CRT is fairly smart about "reading ahead" as needed. If the next sequence in the file is for a 4 byte UTF8 character, then the CRT will read the 4 bytes and convert it to 2 wchar_t's (two UTF16LE codepoints representing a single character). Two calls to fgetwc() will return the surrogate pair of wchar_t's representing that character.

gg

**PaulBlay** · 05-27-2009

Originally Posted by Codeplug

>> That's the bit I don't quite get. Just because ...
Didn't follow that second sentence. The point of "ccs=UTF-8" is so the CRT knows how to convert <file content encoding> => UTF16LE (wchar_t). Using the css mode switch will also automatically mark the stream as wide-oriented - so using byte-oriented functions on the stream (fgetc, etc...) is an error. You could use fread(), but that will just read wchar_t values as bytes (the underlying CRT code always does the conversion to UTF16LE).

Well I was using _fgettc, which (as you point out) is just defined as fgetwc. Is that a "byte oriented" function? Because the problem I had is that the file content was not converted correctly into UTF16LE. Instead it seemed to be handled as if the content was already UTF-16 producing a string of mojibake (Japanese term for messed up characters).

>> ... the one to three UTF8 bytes needed to produce the right Unicode codepoint.
With the current Unicode standard, you can have up to 4 UTF8 bytes to represent a single Unicode character.

I'm aware of that, but I think that 3 is all that is needed to cover standard characters including the complete set of kanji (&hanzi) in general use.

The CRT is fairly smart about "reading ahead" as needed. If the next sequence in the file is for a 4 byte UTF8 character, then the CRT will read the 4 bytes and convert it to 2 wchar_t's (two UTF16LE codepoints representing a single character). Two calls to fgetwc() will return the surrogate pair of wchar_t's representing that character.

That doesn't appear to be what happened in my case.

Incidentally, what does CRT stand for?

**matsp** · 05-27-2009

CRT = C RunTime -> the libraries supplied with the compiler.

--
Mats

**Codeplug** · 05-27-2009

>> Is that a "byte oriented" function?
fgetwc() is a wide-oriented function. Here's a little documentation on the subject: msdn.microsoft.com/en-us/library/kb1at4by.aspx

I had no problems with the following code:

Code:

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>

int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);
    
    fclose(f);
    return 0;
}//main

The data.txt I used contained a single character: 木, U+6728, or "E6 9C A8" in UTF8 (file attached, no BOM). Worked Ok with and without BOM (not using 'b' binary mode).

If you have a UTF8 file that causes this code to output unexpected codepoints, upload the file and I'll see if I also get unexpected results.

I'm using a fully patched VS 2008, and accompanying CRT by the way.

gg

**Ronix** · 05-27-2009

Originally Posted by Codeplug

>> (Technically UCS-2)
That is true for 98/ME, but Win2K and later platforms support true UTF16LE.

My mistake. For some reason, I thought that the fixed width of wchar_t's made it impossible to work with surrogate pairs. Now I see that it's handled in the simplest way.

**PaulBlay** · 05-27-2009

Originally Posted by Codeplug

I had no problems with the following code:

I also had no problems with that code.

Obviously the problem lies in my code (which is a little complicated to post because relevant bits are scattered around the place). I'll see if I can write a suitable test case that produces the same problem (if I accidentally rewrite it so that the problem goes away that's even better ;-).

(not using 'b' binary mode).

Oh wait. That would appear to be exactly where the problem was. At least I can console myself that the error in my code was actually clearly shown in my first post.

**Codeplug** · 05-27-2009

Well that was an easy test that I didn't do.
Using "rb, ccs=UTF-8" as the mode, the output was "9CE6".
The contents of data.txt are "E6 9C A8".

So 'b' basically truncates any 'ccs=' setting - which I guess makes sense, since 'b' mode is referred to as "(untranslated) mode" on MSDN.

gg

Thread: Reading from a UTF-8 text file

Thread Tools

Search Thread

Display

Reading from a UTF-8 text file

Similar Threads

Inventory records

C++ std routines

Reading Character at a time from a text file

A bunch of Linker Errors...

reading from a text file help......