fread + non-english characters

This is a discussion on fread + non-english characters within the C++ Programming forums, part of the General Programming Boards category; Originally Posted by phantomotap O_o May we assume you are using the `MultiByteToWideChar' API? You should be careful with that ...

  1. #16
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Posts
    22,918
    Quote Originally Posted by phantomotap View Post
    O_o

    May we assume you are using the `MultiByteToWideChar' API?

    You should be careful with that function; it is a nightmare.
    Yes. If you have a better suggestion, I'm all ears.

    I can't help but think that you are going about this the wrong way.

    True `UTF8' strings have no embedded nulls. You should get what you set.
    My strings are all UTF-16, and in UTF-16, L'' is encoded as a one byte constant, but in UTF-8, it's encoded as two constants. Just copying wchar_t to char isn't going to work. I need to convert my UTF-16 strings to UTF-8 and back.

    What I meant by sensible output is that if I try to print a UTF-8 string, it's not always going to look pretty. But the data is encoded correctly, so as long as I don't print it, I should be fine.

    Well, good news is: it works. I made sure to convert all non-UTF16 string literals into UTF-8 ones.
    Last edited by Elysia; 04-19-2010 at 06:43 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  2. #17
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    4,370
    If you have a better suggestion, I'm all ears.
    There is nothing wrong with the mechanism it performs. The problems come from how easy it is to use it incorrectly. I strongly suggest you read the anecdotes to be found online.

    My strings are all [...] UTF-16 strings to UTF-8 and back.
    O_o

    No. `UTF16' is a 16 bit (two octet) encoding scheme; every `UTF16' character is two or four octets.

    What I meant by sensible [...] print it, I should be fine.
    O_o

    Okay. I'm tired of this crap. You win. I'll tell you your problem.

    "ANSI" is not a character set.
    `ASCII' is a character set.
    "UNICODE" is an umbrella term for a massive character set and data structures and algorithms related to that character set.
    "UNICODE" inherits, for lack of a better word, the `ASCII' character set.
    `UTF8' is an encoding scheme for "UNICODE".
    `UTF8' will represent any `ASCII' character in one octet.
    `Windows-1252' is a character set.

    And why you are having your problem: `Windows-1252' is not `ASCII'.

    Or to put it another way: `Windows-1252' is not byte for byte compatible with `UTF8'.

    Soma

  3. #18
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Posts
    22,918
    *shrug* If you feel that way...
    I will only tell you that Notepad has an option to save in "ANSI" code page. This is 99% likely wrong, but that's where I got part of the "ANSI" from anyway. Blame M$.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #19
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    4,370
    This isn't a matter of "feeling". "ANSI" isn't a character set and `Windows-1252' is not compatible with `UTF8'. It is fact.

    I do blame Microsoft. They have stunted the growth of countless content creators who want to understand language and locale by disguising important issues.

    Soma

  5. #20
    Registered User
    Join Date
    Mar 2010
    Posts
    109
    Oh, hey, since you mentioned headerless UTF-8 files, watch out for Notepad if you are using it to convert encodings. I've had it stick headers in files randomly (I'm sure there was a rhyme to it, but I never figured it out). VS is a lot better in my experience.

  6. #21
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Posts
    22,918
    Quote Originally Posted by phantomotap View Post
    I do blame Microsoft. They have stunted the growth of countless content creators who want to understand language and locale by disguising important issues.
    Then we are in agreement!
    Quote Originally Posted by syzygy View Post
    Oh, hey, since you mentioned headerless UTF-8 files, watch out for Notepad if you are using it to convert encodings. I've had it stick headers in files randomly (I'm sure there was a rhyme to it, but I never figured it out). VS is a lot better in my experience.
    Yeah, I'm using VS to edit the XML file. After all, it's one heck of a better editor for XML files than notepad!
    Last edited by Elysia; 04-19-2010 at 07:06 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #22
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,691
    >> And terminal-EOF characters are interpreted as EOF (that's Ctrl-D on *nix and Ctrl-Z on Windows).
    To clear up my own statement, I do not believe *nix implementations perform any special "terminal-EOF" interpretation when reading from a file. I don't think there's any real difference between "text mode" and "binary mode" in that case.

    >> Yes, I'm aware. I would use Unicode ... TinyXML doesn't support wchar_t.
    It's not a matter of using wchar_t for the literals. When your source code contains a character literal that isn't a member of the "basic source character set", you're in implementation defined territory.
    FWIW, I once attempted to put this issue in terms of the C++ standard, and how MSVC 6.0 and 2008 handle it - CodeGuru Forums - View Single Post - ConvertStringToBSTR lost char

    >> Especially when you put non-basic character literals directly in your source code - at which point you're in implementation defined territory.
    It's worse than that. This can easily lead to trouble with multiple developers each using their own choice of editor.
    Why did you check-in that change with L"V?gl?ra"
    [or]
    Why did you check-in that change with L"Vĺglra" (UTF8 bytes interpreted as ISO-8859-2, then saved in some other format - or copy/pasted, etc...)
    >> What I meant by sensible output is that if I try to print a UTF-8 string, it's not always going to look pretty
    Another FWIW:
    how do i printf("") to the console?
    The "_setmode(_fileno(stdout), _O_U16TEXT)" trick also works for std::wcout in the MSCRT that comes with 2008.

    gg

  8. #23
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Posts
    22,918
    This is even worse than I thought. So as soon as I use string literals inside the editor, we're in implementation defined land.
    So barring that, I see two things. How many/which compilers treat wchar_t as UTF16(LE)? Is it bad?
    Should I try keeping the data encoded in a separate section, ie a resource file and load them from there? Can this avoid implementation defined land?
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  9. #24
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    How many/which compilers treat wchar_t as UTF16(LE)?
    All Windows compilers that I know of, including MinGW GCC. On the other hand, all Linux compilers treat it as UTF-32 with platform endianness.

    Keeping string data in resource files is a good idea.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #25
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,691
    All windows compilers should treat wchar_t literals at UTF16LE during execution. However, MSVC didn't understand source code that's encoded in something other than the users ACP until 2005 (I believe). (Well, you could tell compiler which code page to use via #pragma setlocale, but the source file is still code-page encoded.)

    In the 2008 IDE, you have File->Advanced Save Options, which lets you save files as, for example, UTF8-with-BOM. The problem is when someone else uses an editor that doesn't understand that - or tries to use that code with a different compiler etc.

    Resource files are fine, however you'll be representing the strings in the same way that you should be representing them in your source code - using only the basic source character set:
    Code:
    STRINGTABLE
    BEGIN
    IDS_CHINESESTRING L"\x5e2e\x52a9"
    IDS_RUSSIANSTRING L"\x0421\x043f\x0440\x0430\x0432\x043a\x0430"
    IDS_ARABICSTRING L"\x062a\x0639\x0644\x064a\x0645\x0627\x062a"
    END
    STRINGTABLE Resource (Windows)

    VC++ 2003 added support for universal character names in identifiers and literals (C99 feature). So you could use \u instead of \x in source files.

    GCC has -finput-charset, -fexec-charset, -fwide-exec-charset - but it's a good idea to stick with the defaults for the platform, and just use the basic source character set here as well.

    gg

  11. #26
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Posts
    22,918
    Oh, they have indeed made portable international programming a pain.
    I'll stick to Windows compliance with encoding strings via wide chars and assuming UTF16(LE).
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Page 2 of 2 FirstFirst 12
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. problem with reading characters
    By csvraju in forum C Programming
    Replies: 4
    Last Post: 03-31-2009, 08:59 AM
  2. Removing Specific Characters from Strings
    By gfmartin05 in forum C++ Programming
    Replies: 4
    Last Post: 02-09-2009, 09:53 AM
  3. using fread on stdin
    By nadroj in forum C Programming
    Replies: 29
    Last Post: 10-23-2008, 03:03 PM
  4. Characters in a txt file.
    By tay_highfield in forum C Programming
    Replies: 3
    Last Post: 01-31-2003, 09:19 AM
  5. : trademark or bullcrap?
    By Aran in forum A Brief History of Cprogramming.com
    Replies: 22
    Last Post: 11-19-2001, 08:00 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21