Thread: Reading from a UTF-8 text file

  1. #1
    Registered User
    Join Date
    Mar 2009
    Posts
    46

    Reading from a UTF-8 text file

    I've got a UTF-8 format (no BOM) text file that I want to read from.

    I open it with

    Code:
    	_tfopen_s(&f->fh, buf, "rb, ccs=UTF-8");
    and everything looks OK to that point. Later I try to get a wide(?)character from it with

    Code:
    	_TINT i = _fgettc(f->fh);
    but it seems to have forgotten that the file is in UTF-8 and is instead treating it like it is in UTF-16(?).

    I guess I could just take one byte at a time with fgetc and post-process it to cope with multi-byte characters but is that the way you're supposed to do it or have I missed something?

    I was expecting that the 'get' function would take as many bytes as needed based on the Unicode encoding used (UTF-8) and put them together for me. Otherwise what's the point of telling it to open as UTF-8 in the first place?

  2. #2
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    As far as I know, _fgettc will expand to either fgetwc or fgetc (Based on whether _UNICODE is defined or not), so yes, you'll most likely have to read single bytes and assemble them yourself.

    Quote Originally Posted by PaulBlay
    I was expecting that the 'get' function would take as many bytes as needed based on the Unicode encoding used (UTF-8) and put them together for me.
    That's what it does if _fgettc is fgetwc. The problem, I think, is that it can only "put them together" to form an UTF-16 character (Technically UCS-2)

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    In addition, don't bother yourself with _fgettc or anything to do with TCHAR's. If you using Unicode on Windows, use wchar_t and functions using wchar_t directly.

    >> (Technically UCS-2)
    That is true for 98/ME, but Win2K and later platforms support true UTF16LE.
    cboard.cprogramming.com/c-programming/104360-size-%5B-bytes%5D-unicode-string.htm

    Under Windows, wchar_t's are encoded as UTF16LE code points. So fgetwc() will return a UTF16LE code point by converting the UTF8 file contents. Remember to use a wint_t data type with fgetwc() so you can compare with WEOF correctly.

    Also keep in mind that since you are using binary mode, if you encounter a file with a UTF8 BOM it will not be skipped over for you. So you may want to check if the first 3 bytes are EF BB BF.

    gg

  4. #4
    Registered User
    Join Date
    Mar 2009
    Posts
    46
    Thanks for the feedback, both of you.

    Quote Originally Posted by Codeplug View Post
    In addition, don't bother yourself with _fgettc or anything to do with TCHAR's. If you using Unicode on Windows, use wchar_t and functions using wchar_t directly.
    I am, and I know that I could, but I do have my reasons. Basically I thought they would be handy macros to redefine when/if I get to the point of looking at portability to other platforms.

    That is true for 98/ME
    Which I have no intention of touching. As I understand it support of Unicode was patchy at best for the Windows API at that point and I'm going to have enough trouble with everything else I need to do.

    Under Windows, wchar_t's are encoded as UTF16LE code points. So fgetwc() will return a UTF16LE code point by converting the UTF8 file contents.
    That's the bit I don't quite get. Just because the output is wint_t (for later wchar_t use) doesn't mean that the function couldn't have been made to read the one to three UTF8 bytes needed to produce the right Unicode codepoint. But I understand that the world is as it is so there's no point arguing about how I think things should be.

    Also keep in mind that since you are using binary mode, if you encounter a file with a UTF8 BOM it will not be skipped over for you. So you may want to check if the first 3 bytes are EF BB BF.
    The files used are managed my side so that shouldn't be a problem. I went for UTF-8 (no BOM) in the first place as some things I'd read suggested that would be more convenient for eventual portability issues. I think I'll probably just go for UTF-16 now, though.

  5. #5
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> Basically I thought they would be handy macros to redefine when/if I get to the point of looking at portability ...
    Well, as you may already know, the "TCHAR mechanism" allows you to write code that could be compiled for either Unicode (UTF16LE) or SBCS or MBCS. SBCS is just your good 'ole codepage character set - 8 bits representing a glyph, where the codepage determines which glyph to use for each 8 bit value. MBCS can use multiple 8 bit values to represent a single glyph - like Shift-JIS or GBK.

    If you have no need to support codepages, then use straight wchar_t's. This means your app will not be able to target 9x platforms (unless you link against MSLU), and your code won't support multi-byte codepages. 9x platforms are dead and Unicode is preferred over MBCS for new development. Thus my "forget TCHAR's" advocacy.

    >> That's the bit I don't quite get. Just because ...
    Didn't follow that second sentence. The point of "ccs=UTF-8" is so the CRT knows how to convert <file content encoding> => UTF16LE (wchar_t). Using the css mode switch will also automatically mark the stream as wide-oriented - so using byte-oriented functions on the stream (fgetc, etc...) is an error. You could use fread(), but that will just read wchar_t values as bytes (the underlying CRT code always does the conversion to UTF16LE).

    >> ... the one to three UTF8 bytes needed to produce the right Unicode codepoint.
    With the current Unicode standard, you can have up to 4 UTF8 bytes to represent a single Unicode character (the RFC allows up to 6 bytes for future expansion). The CRT is fairly smart about "reading ahead" as needed. If the next sequence in the file is for a 4 byte UTF8 character, then the CRT will read the 4 bytes and convert it to 2 wchar_t's (two UTF16LE codepoints representing a single character). Two calls to fgetwc() will return the surrogate pair of wchar_t's representing that character.

    gg

  6. #6
    Registered User
    Join Date
    Mar 2009
    Posts
    46
    Quote Originally Posted by Codeplug View Post
    >> That's the bit I don't quite get. Just because ...
    Didn't follow that second sentence. The point of "ccs=UTF-8" is so the CRT knows how to convert <file content encoding> => UTF16LE (wchar_t). Using the css mode switch will also automatically mark the stream as wide-oriented - so using byte-oriented functions on the stream (fgetc, etc...) is an error. You could use fread(), but that will just read wchar_t values as bytes (the underlying CRT code always does the conversion to UTF16LE).
    Well I was using _fgettc, which (as you point out) is just defined as fgetwc. Is that a "byte oriented" function? Because the problem I had is that the file content was not converted correctly into UTF16LE. Instead it seemed to be handled as if the content was already UTF-16 producing a string of mojibake (Japanese term for messed up characters).

    >> ... the one to three UTF8 bytes needed to produce the right Unicode codepoint.
    With the current Unicode standard, you can have up to 4 UTF8 bytes to represent a single Unicode character.
    I'm aware of that, but I think that 3 is all that is needed to cover standard characters including the complete set of kanji (&hanzi) in general use.

    The CRT is fairly smart about "reading ahead" as needed. If the next sequence in the file is for a 4 byte UTF8 character, then the CRT will read the 4 bytes and convert it to 2 wchar_t's (two UTF16LE codepoints representing a single character). Two calls to fgetwc() will return the surrogate pair of wchar_t's representing that character.
    That doesn't appear to be what happened in my case.

    Incidentally, what does CRT stand for?

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    CRT = C RunTime -> the libraries supplied with the compiler.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> Is that a "byte oriented" function?
    fgetwc() is a wide-oriented function. Here's a little documentation on the subject: msdn.microsoft.com/en-us/library/kb1at4by.aspx

    I had no problems with the following code:
    Code:
    #define _CRT_SECURE_NO_WARNINGS
    #include <stdio.h>
    
    int main()
    {
        FILE *f = fopen("data.txt", "r, ccs=UTF-8");
        if (!f)
            return 1;
    
        for (wint_t c; (c = fgetwc(f)) != WEOF;)
            printf("%04X\n", c);
        
        fclose(f);
        return 0;
    }//main
    The data.txt I used contained a single character: 木, U+6728, or "E6 9C A8" in UTF8 (file attached, no BOM). Worked Ok with and without BOM (not using 'b' binary mode).

    If you have a UTF8 file that causes this code to output unexpected codepoints, upload the file and I'll see if I also get unexpected results.

    I'm using a fully patched VS 2008, and accompanying CRT by the way.

    gg

  9. #9
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    Quote Originally Posted by Codeplug View Post
    >> (Technically UCS-2)
    That is true for 98/ME, but Win2K and later platforms support true UTF16LE.
    My mistake. For some reason, I thought that the fixed width of wchar_t's made it impossible to work with surrogate pairs. Now I see that it's handled in the simplest way.

  10. #10
    Registered User
    Join Date
    Mar 2009
    Posts
    46
    Quote Originally Posted by Codeplug View Post
    I had no problems with the following code:
    I also had no problems with that code.

    Obviously the problem lies in my code (which is a little complicated to post because relevant bits are scattered around the place). I'll see if I can write a suitable test case that produces the same problem (if I accidentally rewrite it so that the problem goes away that's even better ;-).

    (not using 'b' binary mode).
    Oh wait. That would appear to be exactly where the problem was. At least I can console myself that the error in my code was actually clearly shown in my first post.
    Last edited by PaulBlay; 05-27-2009 at 12:04 PM.

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Well that was an easy test that I didn't do.
    Using "rb, ccs=UTF-8" as the mode, the output was "9CE6".
    The contents of data.txt are "E6 9C A8".

    So 'b' basically truncates any 'ccs=' setting - which I guess makes sense, since 'b' mode is referred to as "(untranslated) mode" on MSDN.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Inventory records
    By jsbeckton in forum C Programming
    Replies: 23
    Last Post: 06-28-2007, 04:14 AM
  2. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 12:13 AM
  3. Reading Character at a time from a text file
    By Giania in forum C Programming
    Replies: 8
    Last Post: 02-25-2006, 03:17 PM
  4. A bunch of Linker Errors...
    By Junior89 in forum Windows Programming
    Replies: 4
    Last Post: 01-06-2006, 02:59 PM
  5. reading from a text file help......
    By jodders in forum C++ Programming
    Replies: 2
    Last Post: 01-25-2005, 12:51 PM