fread + non-english characters

This is a discussion on fread + non-english characters within the C++ Programming forums, part of the General Programming Boards category; A question. Consider the code: Code: FILE* f = fopen("settings.xml", "r"); char buf[10240]; fread(buf, 10240, 1, f); fclose(f); This would ...

  1. #1
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,167

    fread + non-english characters

    A question.
    Consider the code:
    Code:
    	FILE* f = fopen("settings.xml", "r");
    	char buf[10240];
    	fread(buf, 10240, 1, f);
    	fclose(f);
    This would be used to read in an entire file. The problem is that when there are international characters, they become all messed up in the reading process.
    Also, this is not my code. The tinyxml C++ library reads files this way and obviously it's causing me problems!

    So I was wondering what the best way to fix this would be?
    Preferably, I don't want to rewrite the tinyxml code, so workarounds are welcome.
    As a last resort, I may rewrite parts of it, but then the question would be what the appropriate way to read files with international characters would be?
    The file should be in UTF-8 (no leading bytes).
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.
    For information on how to enable C++11 on your compiler, look here.
    よく聞くがいい!私は天才だからね! ^_^

  2. #2
    Registered User
    Join Date
    Dec 2006
    Location
    Scranton, Pa
    Posts
    252
    I know nothing about different char sets, but using c shouldn't you at least be reading as binary? rb, not r?

  3. #3
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    How do they get messed up? UTF-8 has no endianness issues, so fread should work just fine.

  4. #4
    Registered User
    Join Date
    Mar 2010
    Posts
    109
    I think it's more likely that they are messed up in the output process since you do have all the data, just not how the display mechanism is expecting it.

    One way of doing it, is while characters are being read in, you examine each byte to see if it falls within range of certain values. This tells you if the following byte is also part of the character or a new character itself. Then you can store each character itself in a wchar array so you have a more direct representation of the characters and not just the bytes in the file.

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    I suppose I could try this to find out if it works, but it's more fun to make you do it: you've got all the bytes in your buffer. Could you put that in a stringstream and then read out of that stream in the usual way?

  6. #6
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,167
    The data is somehow changed. I know this is the case, as when it reads "Våglära" and I compare it to "Våglära", I don't get the same match. Ie, they're not equal.
    Further research suggests:
    - The data is "corrupt" inside the internal FILE structure.
    - fgets does not read data correctly.
    - Using unsigned char doesn't help.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.
    For information on how to enable C++11 on your compiler, look here.
    よく聞くがいい!私は天才だからね! ^_^

  7. #7
    Registered User
    Join Date
    Mar 2010
    Posts
    109
    If you are getting corrupt data by reading in, I think Oldman47 is correct when suggesting you use read binary, otherwise it's doing some wonky business assuming ASCII text.

    Edit: Otherwise, pass the proper encoding to fopen, but that will require use of wchar stuff to store things.
    Last edited by syzygy; 04-19-2010 at 04:38 PM.

  8. #8
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    3,797
    Just out of curiosity, have you tried setting the `TiXmlEncoding' parameter to `TIXML_ENCODING_UTF8'?

    How are you comparing those strings?

    Soma

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,607
    On Windows you'll get CRLF->LF conversions (when not opened in binary mode). And terminal-EOF characters are interpreted as EOF (that's Ctrl-D on *nix and Ctrl-Z on Windows).

    Other than that, the data shouldn't be "changed" or interpreted.

    >> and I compare it to "Våglära", I don't get the same match
    It's not hard to get this wrong. Especially when you put non-basic character literals directly in your source code - at which point you're in implementation defined territory.

    gg

  10. #10
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,167
    Quote Originally Posted by syzygy View Post
    If you are getting corrupt data by reading in, I think Oldman47 is correct when suggesting you use read binary, otherwise it's doing some wonky business assuming ASCII text.

    Edit: Otherwise, pass the proper encoding to fopen, but that will require use of wchar stuff to store things.
    Just reading binary into char doesn't help. Just tried that. Will try more later.

    Quote Originally Posted by phantomotap View Post
    Just out of curiosity, have you tried setting the `TiXmlEncoding' parameter to `TIXML_ENCODING_UTF8'?

    Soma
    I don't know where that parameter would do.
    I have tried debugging the TinyXML code, however, and it does detect and set UTF-8 encoding properly.

    >> and I compare it to "Våglära", I don't get the same match
    It's not hard to get this wrong. Especially when you put non-basic character literals directly in your source code - at which point you're in implementation defined territory.
    Yes, I'm aware. I would use Unicode (the rest of the project is), but unfortunately, by some misfortune, TinyXML doesn't support wchar_t.
    I'm not an expert on the whole codepage field, though, so that's why I'm asking.
    Last edited by Elysia; 04-19-2010 at 04:46 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.
    For information on how to enable C++11 on your compiler, look here.
    よく聞くがいい!私は天才だからね! ^_^

  11. #11
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    3,797
    I would use Unicode (the rest of the project is), but unfortunately, by some misfortune, TinyXML doesn't support wchar_t.
    ^_^

    Wow. You are squarely in Microsoft's pocket.

    What you are missing (or what you failed to answer): the compiler may store the string in a different UTF8 binary representation than what "TinyXML" produces. The string may be seven bytes from one source and nine bytes (I think) from another. Again, how are you comparing these strings? What, exactly, do you mean by "not equal"?

    Soma

  12. #12
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,167
    Apparently, the comparison is done via strcmp (again, what TinyXML uses). So basically I read some data from the XML-file and compare it to a directory name (that should be encoded in Unicode).
    I believe I have found the issue. It's probably ANSI/UTF-8 mismatch. If I save the file in ANSI encoding, it looks "right". If I save it in UTF-8 encoding, it looks "wrong".
    This might be because VS is interpreting char as ANSI. But I did notice the same thing when I examined it in a hex editor.
    My main theory so far is that TinyXML reads raw UTF-8 data, yet strings (char*) are encoded in ANSI. I have to verify this later.

    I have been able to make it work when saving the file as ANSI (yet stating UTF-8 encoding in the xml header).
    Apparently, 'å' and L'å' yields the same "magic number", so this works (since I am internally using wchar_t for my strings and hence has to convert to char when comparing to or from data from the xml file).
    However, I feel like I am out on deep water here.
    Last edited by Elysia; 04-19-2010 at 05:14 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.
    For information on how to enable C++11 on your compiler, look here.
    よく聞くがいい!私は天才だからね! ^_^

  13. #13
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    3,797
    This might be because VS is interpreting char as ANSI.
    Is it a nice pocket? ^_^v

    Apparently, the comparison is done via strcmp.
    O_o

    If "TinyXML" uses `strcmp' it is probably safe to assume that the library does not touch the binary representation.

    I'd advise you to always encode a character using the hexadecimal escape sequences for any value that isn't in the seven bit `ASCII' range.

    You need to check that the string stored in the executable has the same binary representation that you have in the "XML" file.

    Soma

  14. #14
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,167
    I've just confirmed that converting L"Våglära" to UTF-8 creates an equal string that was stored inside the file.
    I suppose I can use std::string to hold the UTF-8, and it should work. I just cannot expect to get any sensible output. That's my next test.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.
    For information on how to enable C++11 on your compiler, look here.
    よく聞くがいい!私は天才だからね! ^_^

  15. #15
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    3,797
    I've just confirmed that converting L"Våglära" to UTF-8 creates an equal string that was stored inside the file.
    O_o

    May we assume you are using the `MultiByteToWideChar' API?

    You should be careful with that function; it is a nightmare.

    I just cannot expect to get any sensible output. That's my next test.
    I can't help but think that you are going about this the wrong way.

    True `UTF8' strings have no embedded nulls. You should get what you set.

    Soma

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. problem with reading characters
    By csvraju in forum C Programming
    Replies: 4
    Last Post: 03-31-2009, 07:59 AM
  2. Removing Specific Characters from Strings
    By gfmartin05 in forum C++ Programming
    Replies: 4
    Last Post: 02-09-2009, 08:53 AM
  3. using fread on stdin
    By nadroj in forum C Programming
    Replies: 29
    Last Post: 10-23-2008, 02:03 PM
  4. Characters in a txt file.
    By tay_highfield in forum C Programming
    Replies: 3
    Last Post: 01-31-2003, 08:19 AM
  5. Æ: trademark or bullcrap?
    By Aran in forum A Brief History of Cprogramming.com
    Replies: 22
    Last Post: 11-19-2001, 07:00 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21