Making my library compatible with unicode, in Ubuntu

**Lucide** · 04-09-2018

Hi :), first post

Some years ago I wrote a text-based graphic library, which worked similarly to Scratch, but with 2d arrays of integers as sprites, one character per cell.
I used integers instead of chars because I knew I would soon or later add compatibility for unicode, so here I am now, three years later.

Some days ago I studied the topic online (mostly here, here, there and of course stack overflow), and I came out with this:

utf-32 encoding is the best option for me, as I treat characters matrices like bitmap images (multi byte encodings would be just painful and unnecessary)
Using utf-32 means using wide char variables, which are treated with the wchar version of any basic string based function (in/out included)
The stdout stream has to be oriented in wide char mode, and the orientation is set at the time the first print is made, and practically can't be modified after
Once oriented, stdout can't be correctly used to print with classic, non wchar functions

If everything above is correct, here's my considerations:

At the time, I thought about a quick n' dirt, temporary, solution to show some non-ascii characters: using nine negative values (from -1 to -9) to eventually launch one of nine printf("..") functions under a switch(). That's because when hardcoded (es printf("☃")), printf seems to handle correctly non ascii chars, someway. I really can't explain this.
If I can, I would really prefer to not change stdout's orientation (see point 1.)

So here's the questions:

Why printf() prints non ascii characters in a narrow-oriented stdout?
Can I print wide chars while maintaining the stream narrow? (it's a library, the user would get mad to print every time with wchar functions, even just to debug)
I provided some context, what you experts would use in this situation?
Does it take more time to print wide characters? can it be problematic?

*phew*
Thank you!!
Lucide

**Salem** · 04-09-2018

Well on my Ubuntu machine, wide characters are encoded in UTF-8 when assigned to char arrays.

Code:

$ cat foo.c
#include <stdio.h>
#include <string.h>
int main ( ) {
    char buff[] = "☃";
    printf("%s\n",buff);
    printf("%zd\n",strlen(buff));
}
$ gcc foo.c
$ ./a.out 
☃
3
$ ./a.out | hd
00000000  e2 98 83 0a 33 0a                                 |....3.|
00000006

Whereas this does something different.

Code:

#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main ( ) {
    wchar_t buff[] = L"☃";
    printf("%S\n",buff);
    printf("%zd\n",wcslen(buff));
}

> Why printf() prints non ascii characters in a narrow-oriented stdout?
Presumably because you used a narrow char, and the compiler automatically converted it all to utf-8.

> Does it take more time to print wide characters? can it be problematic?
I/O is always relatively inefficient in time, so why worry about it.
Also, you're at the mercy of
- the quality of the std library implementation
- the extent of wide character support (either at a basic level of wide characters at all, or whether specific glyphs have been defined) in whatever terminal/console you end up using.

**john.c** · 04-09-2018

Technically you should be saying "%zu" instead of "%zd". Otherwise the very first time you encounter a string over 9.2 quadrillion characters long, BANG!

**Lucide** · 04-10-2018

Damn, it's even more complicated than I thought, but a lot more things makes sense, thank you!

Ok, so I can print non ascii in a narrow stream, but comprehensibly only if they're represented with a narrow (groups of one byte) encoding, like, as you said, utf-8.
Alright, this is the way I want to follow, I'll need to write:

a struct or preferably a 4byte variable to store my utf-8-or-else chars
a utf-8-or-else text stream interpreter, for both stdin and text files, to recognize bytes that are part of a sequence

In this article are shown three ways to work with non ascii characters:

You are free to choose a string encoding for internal use in your program. The choice pretty much boils down to either UTF-8, wide (4-byte) characters, or multibyte. Each has its advantages and disadvantages:

UTF-8
- Pro: compatible with all existing strings and most existing code
- Pro: takes less space
- Pro: widely used as an interchange format (e.g. in XML)
- Con: more complex processing, O(n) string indexing
Wide characters
- Pro: easy to process
- Con: wastes space
- Pro/Con: although you can use the syntaxL"Hello, world."to easily include wide-character strings in C programs, the size of wide characters is not consistent across platforms (some incorrectly use 2-byte wide characters)
- Con: should not be used for output, since spurious zero bytes and other low-ASCII characters with common meanings (such as '/' and '\n') will likely be sprinkled throughout the data.
Multibyte
- Pro: no conversions ever needed on input and output
- Pro: built-in C library support
- Pro: provides the widest possible internationalization, since in rare cases conversion between local encodings and Unicode does not work well
- Con: strings are opaque
- Con: perpetuates incompatibilities. For example, there are three major encodings for Russian. If one Russian sends data to another through your program, the recipient will not be able to read the message if his or her computer is configured for a different Russian encoding. But if your program always converts to UTF-8, the text is effectively normalized so that it will be widely legible (especially in the future) no matter what encoding it started in.

But I didn't really got what exactly "multibyte" is, and means.
There are also some cool code snippets for utf-8

**Lucide** · 04-11-2018

I did some even additional research and I found out new things, but, as before, I'd like to have some confirmations.

I understood that wchar type and wchar functions are not there to simply work, read, and write using a 32bit encoding,
but they're a solution to manage characters more easily inside the program, and i/o functions are instead able to decode/encode
characters from/to a defined locale.

To say, I could set the locale to utf-8, and by using wide chars functions i should be able to read from utf-8 input (to wchar_t variables/strings),
and write utf-8 encoded text (still from wchar_t variables/strings).
This approach would be another way to keep stdin/out narrow

Did I get it right?

I also read about the ICU library, would you suggest it?

what you experts would use in this situation?

Still couldn't find anything about the "multibyte" solution cited above.

Thank you
source: string - Handling multibyte (non-ASCII) characters in C - Stack Overflow

**christop** · 04-12-2018

UTF-8 is a "multibyte" encoding. In UTF-8 a character can be from one up to four bytes long.

**Lucide** · 04-12-2018

I know, that's why I'm confused about the article's subdivision
I'll post it again:
Making my library compatible with unicode, in Ubuntu-proof-jpg

**christop** · 04-12-2018

Originally Posted by Lucide

I know, that's why I'm confused about the article's subdivision
I'll post it again:
Making my library compatible with unicode, in Ubuntu-proof-jpg

OK, I see. That article is a little confusing, since it also says this in the same section (II. The C library):

"Multibyte character" or "multibyte string" refers to text in one of the many (possibly language-specific) encodings that exist throughout the world. ... UTF-8 is in fact only one such encoding

That matches my understanding of "multibyte" characters/strings, so I'm inclined to think that the list of pros and cons refers to other, non-Unicode multibyte encodings like Shift-JIS.

**Codeplug** · 04-13-2018

More research (with some windows stuff you may not care about): Non-English characters with cout

gg

**Lucide** · 04-13-2018

I read again the article after christop's clarification and I come out to the conclusion of using wide chars:
I'll lose some performance since at every output(and input) the char will have to be encoded, but utf-8 would be even more complex to compare, store and generally to work with.

Maybe in the future I'll give a look to ICU

Thank you all very much!

**jimblumberg** · 04-13-2018

Okay, so what do you think going with wide chars is going to buy you?

I'll lose some performance since at every output(and input) the char will have to be encoded, but utf-8 would be even more complex to compare, store and generally to work with.

What makes you say this?

But if you're basing your decision on just the links you posted, you haven't done near enough research to make a viable decision.

Perhaps you may be interested in these links: UTF8 everywhere, CppCon 2016, Unicode in C++, or CppCon 2014

**Lucide** · 04-13-2018

Well, I posted at least 6 urls between articles and pages here, that I could find and read when I had some spare time.

I didn't write an extensive amount of questions, asking for opinions from experts for nothing.
The answers I got are and the hypothesis I've made are all written above.

Glad to hear you don't agree with the choice I made (given that you've read my usage context), that's an opinion, what I am looking for.

Thank you, I'll read the articles you provided as soon as I can, I assume that I'll find there why you are suggesting utf-8.

**Lucide** · 04-15-2018

I read (by now) the first article you provided.
It's really good, even if a bit windows-oriented

But, by now, I still think I should use wide chars, and I'll explain why:
- especially because i'm working on Linux, output will be almost always encoded in utf-8 anyway. If not, that will probably be an user choice, or another reason preventing my program from writing in utf-8.
- wide chars offer native support for almost whatever encoding a program is fed with, I/O functions take care of that
- as i'm internally not working with strings intended as strings of characters, but with 2d arrays of single characters, I won't benefit from the utf-8's feature of being compatible with (one byte)char arrays
- even If I used somehow a char string, to represent utf-8 chars, I would have to put them on a struct anyway, because:
1. working with a 3d array with different "z" depths would be, if not hard, way more complex, even at code level
2. given point one, i would create accordingly a struct containing a 1-to-4 sized char array (and probably some computing time saving data too). That would defeat utf-8's feature of being more space efficient.

Here's my hypothesis, demolish them

Thread: Making my library compatible with unicode, in Ubuntu

Thread Tools

Search Thread

Display

Making my library compatible with unicode, in Ubuntu

Similar Threads

Missing library on Ubuntu system.

library/linking question - linux(ubuntu) specific

Making a unicode text file

Cross Compatible Input Library

Exception idea in C compatible library. Comments.

Tags for this Thread