Quote:
Well I tried this on the code below, and changed printf("%s", buf) to printf("%.80s", buf), and guess what? It still looks exactly the same, so that means that printf IS AWARE of multibyte strings somehow, because 80 unicode characters are printed, instead of 80 bytes.
You read in a maximum of 80 bytes with your fgets call. Therefore, printf("%s", buf) and printf("%.80s", buf) are equivalent, printing a maximum of 80 bytes. This may be less than 80 characters as a multi-byte character may be more than one byte.
I believe that the numbers provided to fgets, printf, etc refer to a byte count while the numbers passed to fgetws, wprintf, etc refer to a character count (each wide character is a fixed number of bytes). This is the same scheme used by the Windows API. When you think about it, it is the most practical scheme, given the variable number of bytes that can make up a multibyte character.
Quote:
And how the heck does printf know that the buf is composed of multibyte UTF-8 characters anyway?
I think that printf, fgets, etc support both single byte and multibyte character sets. The character set is specified by LC_CTYPE. Remember that ASCII characters are the same in the UTF-8 character set (and many other character sets).
Quote:
Also, I would like to store everything internally as UTF-32/UCS-4 (wchar_t is 32 bits in GLIBC) in wchar_t variables, and then convert the text to the current locale on output.
I think you would do this by reading in your file using fgets and then using the mbstowcs function to convert to a native unicode string.
Quote:
Well this kinda works, but the problem is that the output is fine even though the current locale (an ASCII locale like POSIX or C) is one that does not have most of the characters in the file. If setlocale is implemented correctly, it shouldn't go ahead and print those unicode characters when the locale is not UTF-8. If I open the same file in VIM with a POSIX locale, the file looks completely different than when using an UTF-8 locale. The output stays the same even though I remove setlocale() by the way, which leads me to my next question...
This I don't understand. The default locale should be "C". You could try setting the locale to "C" explicitly.
Quote:
And I still need to find a way to detect what kind of encoding is used on the file I'm trying to open.
Some unicode files contain a byte order mark (BOM). If not I think you need to ask the user, or use heuristics. There is probably detection code out there, Windows provides the IsTextUnicode function.
Disclaimer: These are my understandings. The documentation for this area of C is poor and often implementation specific. Take this post with a grain of salt.