Unicode trouble in Linux

Printable View

06-25-2005
^xor

Unicode trouble in Linux

I'm trying to read this file (in this case UTF-8 encoded) and then print it to stdout.

The problem is that I don't know how to detect and tell fgetws what encoding is used in the file. Also, I would like to store everything internally as UTF-32/UCS-4 (wchar_t is 32 bits in GLIBC) in wchar_t variables, and then convert the text to the current locale on output.

This is what I've got so far, but it only prints garbage (my locale: LC_CTYPE=en_US.UTF-8).

Code:

#include <errno.h> #include <locale.h> #include <stdio.h> #include <wchar.h> int main() { const char filename[] = "UTF-8-demo.txt"; wchar_t buf[80]; FILE *fp; if (!setlocale(LC_ALL, "")) { fprintf(stderr, "Failed to set the specified locale\n"); return 1; } fp = fopen(filename, "r"); if (!fp) { perror(filename); return 1; } while (fgetws(buf, 80, fp)) wprintf(L"%s", buf); return 0; }

Also, I have to compile with std=c99 instead of ansi for some reason, despite that the man pages for wprintf etc say that it conforms to "ISO/ANSI C, UNIX98".
06-25-2005
^xor

I think I've figured out why the code doesn't work. fgetws is reading wide characters (UTF-32 in GLIBC) from the file instead of multibyte characters (UTF-8), and thus messing up the output. However, straight from the fgetws man page:

Quote:

In the absence of additional information passed to the fopen call, it
is reasonable to expect that fgetws will actually read a multibyte
string from the stream and then convert it to a wide character string.

Now what the heck does that mean? What additional information passed to fopen would that be? :confused:

And I still need to find a way to detect what kind of encoding is used on the file I'm trying to open.
06-25-2005
^xor

Well this kinda works, but the problem is that the output is fine even though the current locale (an ASCII locale like POSIX or C) is one that does not have most of the characters in the file. If setlocale is implemented correctly, it shouldn't go ahead and print those unicode characters when the locale is not UTF-8. If I open the same file in VIM with a POSIX locale, the file looks completely different than when using an UTF-8 locale. The output stays the same even though I remove setlocale() by the way, which leads me to my next question...

But that is expected since, and if I understand it correctly, fgets/printf only reads/prints a stream of n bytes, and does not have any concept of any encodings or multibytes characters. So %.10s with printf on multibyte strings will not work as expected and print 10 unicode characters (instead it should print 10 bytes). Well I tried this on the code below, and changed printf("%s", buf) to printf("%.80s", buf), and guess what? It still looks exactly the same, so that means that printf IS AWARE of multibyte strings somehow, because 80 unicode characters are printed, instead of 80 bytes. The real size of every line is in the range [80 - 80*MB_LEN_MAX], so there's no way that should work if printf only counts bytes. And how the heck does printf know that the buf is composed of multibyte UTF-8 characters anyway?

Code:

#include <errno.h> #include <locale.h> #include <stdio.h> int main() { const char *filename = "UTF-8-demo.txt"; char buf[80]; FILE *fp; if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Failed to set the specified locale\n"); return 1; } fp = fopen(filename, "r"); if (!fp) { perror(filename); return 1; } while (fgets(buf, 80, fp)) printf("%s", buf); return 0; }
06-25-2005
anonytmouse

Quote:

Well I tried this on the code below, and changed printf("%s", buf) to printf("%.80s", buf), and guess what? It still looks exactly the same, so that means that printf IS AWARE of multibyte strings somehow, because 80 unicode characters are printed, instead of 80 bytes.

You read in a maximum of 80 bytes with your fgets call. Therefore, printf("%s", buf) and printf("%.80s", buf) are equivalent, printing a maximum of 80 bytes. This may be less than 80 characters as a multi-byte character may be more than one byte.

I believe that the numbers provided to fgets, printf, etc refer to a byte count while the numbers passed to fgetws, wprintf, etc refer to a character count (each wide character is a fixed number of bytes). This is the same scheme used by the Windows API. When you think about it, it is the most practical scheme, given the variable number of bytes that can make up a multibyte character.

Quote:

And how the heck does printf know that the buf is composed of multibyte UTF-8 characters anyway?

I think that printf, fgets, etc support both single byte and multibyte character sets. The character set is specified by LC_CTYPE. Remember that ASCII characters are the same in the UTF-8 character set (and many other character sets).

Quote:

Also, I would like to store everything internally as UTF-32/UCS-4 (wchar_t is 32 bits in GLIBC) in wchar_t variables, and then convert the text to the current locale on output.

I think you would do this by reading in your file using fgets and then using the mbstowcs function to convert to a native unicode string.

Quote:

Well this kinda works, but the problem is that the output is fine even though the current locale (an ASCII locale like POSIX or C) is one that does not have most of the characters in the file. If setlocale is implemented correctly, it shouldn't go ahead and print those unicode characters when the locale is not UTF-8. If I open the same file in VIM with a POSIX locale, the file looks completely different than when using an UTF-8 locale. The output stays the same even though I remove setlocale() by the way, which leads me to my next question...

This I don't understand. The default locale should be "C". You could try setting the locale to "C" explicitly.

Quote:

And I still need to find a way to detect what kind of encoding is used on the file I'm trying to open.

Some unicode files contain a byte order mark (BOM). If not I think you need to ask the user, or use heuristics. There is probably detection code out there, Windows provides the IsTextUnicode function.

Disclaimer: These are my understandings. The documentation for this area of C is poor and often implementation specific. Take this post with a grain of salt.
06-25-2005
^xor

> You read in a maximum of 80 bytes with your fgets call. Therefore, printf("%s", buf) and printf("%.80s", buf) are equivalent, printing a maximum of 80 bytes. This may be less than 80 characters as a multi-byte character may be more than one byte.

But the output was exactly the same as without the precision modifier. I know that the 80 characters on each line use up more than 80 bytes (since a lot of them are multibyte), so how do you explain that the output is identical (no truncation)?

> I believe that the numbers provided to fgets, printf, etc refer to a byte count while the numbers passed to fgetws, wprintf, etc refer to a character count (each wide character is a fixed number of bytes).

I thought so too, but apparently printf does not work like this. fgets counts bytes though (I checked).

> I think that printf, fgets, etc support both single byte and multibyte character sets. The character set is specified by LC_CTYPE. Remember that ASCII characters are the same in the UTF-8 character set (and many other character sets).

But the output is identical even with LC_CTYPE="POSIX"... How does printf know how to printf out those non-ASCII characters when it's using a locale with ASCII encoding (LC_CTYPE="POSIX")? Like I said, the program works even without setlocale, so the default locale (C or POSIX) is used which does not support non-ASCII characters. The fact that it manages to count the special characters correctly regardless of the current locale shows that there's more to it. I assume printing only involves sending a byte stream to stdout, so printf does not really need to know the encoding there, but for precision modifiers it does.

> I think you would do this by reading in your file using fgets and then using the mbstowcs function to convert to a native unicode string.

Yep I understand that part now. Not quite sure how fgetws is useful now that I know how it works, and the part about fopen still confuses me.

> This I don't understand. The default locale should be "C". You could try setting the locale to "C" explicitly.

Is there any difference between C and POSIX? I don't think there is on my system anyway, and the default for GLIBC is POSIX.
06-27-2005
^xor

If anyone is interested, I've figured out the behaviour of the library functions and locales.

printf with a precision modifier, does indeed count the bytes. I have no idea why it didn't work like it was supposed to before, but it does now.

One problem I'm still facing, is opening up files in one encoding, and then printing them out in the user specified locale. I created a test file with a few normal and some latin1 characters, and then first encoded it to UTF-8, and then to latin1 (set fenc=UTF-8/latin1 in VIM). I kept my own locale as en_US.UTF-8 all the time.

When I open the UTF-8 encoded file and print the UTF-8 encoded file, everything will look correctly. However, when I try to open the latin1 (ISO-8859-1) encoded file, the special latin1 characters will be messed up. Again, this is expected since the locale the program uses is UTF-8, so it will try to interpret those characters as UTF-8.

Since all the library functions depend on the specified locale, you can't do anything with byte streams encoded in a different charset than the locale the program uses. How do you overcome this?

VIM will automatically convert the charset in a file to the one I'm using when displaying the text, so there has to be some way to do it.