Thread: Making my library compatible with unicode, in Ubuntu

  1. #1
    Registered User
    Join Date
    Apr 2018
    Posts
    7

    Question Making my library compatible with unicode, in Ubuntu

    Hi :), first post

    Some years ago I wrote a text-based graphic library, which worked similarly to Scratch, but with 2d arrays of integers as sprites, one character per cell.
    I used integers instead of chars because I knew I would soon or later add compatibility for unicode, so here I am now, three years later.

    Some days ago I studied the topic online (mostly here, here, there and of course stack overflow), and I came out with this:
    1. utf-32 encoding is the best option for me, as I treat characters matrices like bitmap images (multi byte encodings would be just painful and unnecessary)
    2. Using utf-32 means using wide char variables, which are treated with the wchar version of any basic string based function (in/out included)
    3. The stdout stream has to be oriented in wide char mode, and the orientation is set at the time the first print is made, and practically can't be modified after
    4. Once oriented, stdout can't be correctly used to print with classic, non wchar functions


    If
    everything above is correct, here's my considerations:


    1. At the time, I thought about a quick n' dirt, temporary, solution to show some non-ascii characters: using nine negative values (from -1 to -9) to eventually launch one of nine printf("..") functions under a switch(). That's because when hardcoded (es printf("")), printf seems to handle correctly non ascii chars, someway. I really can't explain this.
    2. If I can, I would really prefer to not change stdout's orientation (see point 1.)


    So here's the questions:


    1. Why printf() prints non ascii characters in a narrow-oriented stdout?
    2. Can I print wide chars while maintaining the stream narrow? (it's a library, the user would get mad to print every time with wchar functions, even just to debug)
    3. I provided some context, what you experts would use in this situation?
    4. Does it take more time to print wide characters? can it be problematic?


    *phew*
    Thank you!!
    Lucide

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,421
    Well on my Ubuntu machine, wide characters are encoded in UTF-8 when assigned to char arrays.
    Code:
    $ cat foo.c
    #include <stdio.h>
    #include <string.h>
    int main ( ) {
        char buff[] = "☃";
        printf("%s\n",buff);
        printf("%zd\n",strlen(buff));
    }
    $ gcc foo.c
    $ ./a.out 
    ☃
    3
    $ ./a.out | hd
    00000000  e2 98 83 0a 33 0a                                 |....3.|
    00000006
    Whereas this does something different.
    Code:
    #include <stdio.h>
    #include <string.h>
    #include <wchar.h>
    int main ( ) {
        wchar_t buff[] = L"☃";
        printf("%S\n",buff);
        printf("%zd\n",wcslen(buff));
    }
    > Why printf() prints non ascii characters in a narrow-oriented stdout?
    Presumably because you used a narrow char, and the compiler automatically converted it all to utf-8.

    > Does it take more time to print wide characters? can it be problematic?
    I/O is always relatively inefficient in time, so why worry about it.
    Also, you're at the mercy of
    - the quality of the std library implementation
    - the extent of wide character support (either at a basic level of wide characters at all, or whether specific glyphs have been defined) in whatever terminal/console you end up using.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Dec 2017
    Posts
    627
    Technically you should be saying "%zu" instead of "%zd". Otherwise the very first time you encounter a string over 9.2 quadrillion characters long, BANG!
    The world hangs on a thin thread, and that is the psyche of man. - Carl Jung

  4. #4
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    Damn, it's even more complicated than I thought, but a lot more things makes sense, thank you!

    Ok, so I can print non ascii in a narrow stream, but comprehensibly only if they're represented with a narrow (groups of one byte) encoding, like, as you said, utf-8.
    Alright, this is the way I want to follow, I'll need to write:
    • a struct or preferably a 4byte variable to store my utf-8-or-else chars
    • a utf-8-or-else text stream interpreter, for both stdin and text files, to recognize bytes that are part of a sequence


    In this article are shown three ways to work with non ascii characters:
    You are free to choose a string encoding for internal use in your program. The choice pretty much boils down to either UTF-8, wide (4-byte) characters, or multibyte. Each has its advantages and disadvantages:

    • UTF-8
      • Pro: compatible with all existing strings and most existing code
      • Pro: takes less space
      • Pro: widely used as an interchange format (e.g. in XML)
      • Con: more complex processing, O(n) string indexing

      Wide characters
      • Pro: easy to process
      • Con: wastes space
      • Pro/Con: although you can use the syntaxL"Hello, world."to easily include wide-character strings in C programs, the size of wide characters is not consistent across platforms (some incorrectly use 2-byte wide characters)
      • Con: should not be used for output, since spurious zero bytes and other low-ASCII characters with common meanings (such as '/' and '\n') will likely be sprinkled throughout the data.

      Multibyte
      • Pro: no conversions ever needed on input and output
      • Pro: built-in C library support
      • Pro: provides the widest possible internationalization, since in rare cases conversion between local encodings and Unicode does not work well
      • Con: strings are opaque
      • Con: perpetuates incompatibilities. For example, there are three major encodings for Russian. If one Russian sends data to another through your program, the recipient will not be able to read the message if his or her computer is configured for a different Russian encoding. But if your program always converts to UTF-8, the text is effectively normalized so that it will be widely legible (especially in the future) no matter what encoding it started in.
    But I didn't really got what exactly "multibyte" is, and means.
    There are also some cool code snippets for utf-8

  5. #5
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    I did some even additional research and I found out new things, but, as before, I'd like to have some confirmations.

    I understood that wchar type and wchar functions are not there to simply work, read, and write using a 32bit encoding,
    but they're a solution to manage characters more easily inside the program, and i/o functions are instead able to decode/encode
    characters from/to a defined locale.

    To say, I could set the locale to utf-8, and by using wide chars functions i should be able to read from utf-8 input (to wchar_t variables/strings),
    and write utf-8 encoded text (still from wchar_t variables/strings).
    This approach would be another way to keep stdin/out narrow

    Did I get it right?

    I also read about the ICU library, would you suggest it?
    what you experts would use in this situation?
    Still couldn't find anything about the "multibyte" solution cited above.

    Thank you
    source: string - Handling multibyte (non-ASCII) characters in C - Stack Overflow
    Last edited by Lucide; 04-11-2018 at 12:53 PM. Reason: source added

  6. #6
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    653
    UTF-8 is a "multibyte" encoding. In UTF-8 a character can be from one up to four bytes long.

  7. #7
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    I know, that's why I'm confused about the article's subdivision
    I'll post it again:
    Making my library compatible with unicode, in Ubuntu-proof-jpg

  8. #8
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    653
    Quote Originally Posted by Lucide View Post
    I know, that's why I'm confused about the article's subdivision
    I'll post it again:
    Making my library compatible with unicode, in Ubuntu-proof-jpg
    OK, I see. That article is a little confusing, since it also says this in the same section (II. The C library):

    "Multibyte character" or "multibyte string" refers to text in one of the many (possibly language-specific) encodings that exist throughout the world. ... UTF-8 is in fact only one such encoding
    That matches my understanding of "multibyte" characters/strings, so I'm inclined to think that the list of pros and cons refers to other, non-Unicode multibyte encodings like Shift-JIS.

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,980
    More research (with some windows stuff you may not care about): Non-English characters with cout

    gg

  10. #10
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    I read again the article after christop's clarification and I come out to the conclusion of using wide chars:
    I'll lose some performance since at every output(and input) the char will have to be encoded, but utf-8 would be even more complex to compare, store and generally to work with.

    Maybe in the future I'll give a look to ICU

    Thank you all very much!

  11. #11
    Registered User
    Join Date
    May 2010
    Posts
    4,431
    Okay, so what do you think going with wide chars is going to buy you?

    I'll lose some performance since at every output(and input) the char will have to be encoded, but utf-8 would be even more complex to compare, store and generally to work with.
    What makes you say this?

    But if you're basing your decision on just the links you posted, you haven't done near enough research to make a viable decision.

    Perhaps you may be interested in these links: UTF8 everywhere, CppCon 2016, Unicode in C++, or CppCon 2014

  12. #12
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    Well, I posted at least 6 urls between articles and pages here, that I could find and read when I had some spare time.

    I didn't write an extensive amount of questions, asking for opinions from experts for nothing.
    The answers I got are and the hypothesis I've made are all written above.

    Glad to hear you don't agree with the choice I made (given that you've read my usage context), that's an opinion, what I am looking for.
    Thank you, I'll read the articles you provided as soon as I can, I assume that I'll find there why you are suggesting utf-8.

  13. #13
    Registered User
    Join Date
    Apr 2018
    Posts
    7
    I read (by now) the first article you provided.
    It's really good, even if a bit windows-oriented


    But, by now, I still think I should use wide chars, and I'll explain why:
    - especially because i'm working on Linux, output will be almost always encoded in utf-8 anyway. If not, that will probably be an user choice, or another reason preventing my program from writing in utf-8.
    - wide chars offer native support for almost whatever encoding a program is fed with, I/O functions take care of that
    - as i'm internally not working with strings intended as strings of characters, but with 2d arrays of single characters, I won't benefit from the utf-8's feature of being compatible with (one byte)char arrays
    - even If I used somehow a char string, to represent utf-8 chars, I would have to put them on a struct anyway, because:
    1. working with a 3d array with different "z" depths would be, if not hard, way more complex, even at code level
    2. given point one, i would create accordingly a struct containing a 1-to-4 sized char array (and probably some computing time saving data too). That would defeat utf-8's feature of being more space efficient.


    Here's my hypothesis, demolish them

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Missing library on Ubuntu system.
    By Fossaw in forum Tech Board
    Replies: 7
    Last Post: 02-27-2015, 10:11 AM
  2. library/linking question - linux(ubuntu) specific
    By Stonehambey in forum Tech Board
    Replies: 2
    Last Post: 12-10-2008, 09:54 AM
  3. Making a unicode text file
    By Tonto in forum C++ Programming
    Replies: 5
    Last Post: 02-15-2008, 08:21 PM
  4. Cross Compatible Input Library
    By valis in forum Tech Board
    Replies: 6
    Last Post: 08-02-2006, 02:51 PM
  5. Exception idea in C compatible library. Comments.
    By anonytmouse in forum Windows Programming
    Replies: 2
    Last Post: 10-31-2003, 09:11 AM

Tags for this Thread