Thread: usage of char or wchar_t

  1. #1
    Registered User
    Join Date
    Jun 2008
    Posts
    1

    usage of char or wchar_t

    If we are using variable size encodings (like UTF-8 or UTF-16), and we declare a char or a wchar_t for getting and storing the strings, how would it work in case of the memory required to store a particular character is more or less than the wchar_t size?

    For eg, if wchar_t is 2 bytes in some compiler, and a unicode character requires 6 bytes how would it store? Or wouldn't it be a wasteage of memory for ASCII characters since it requires only one byte for ASCII characters in UTF-8 encoding?

  2. #2
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    It depends, because UTF-16 required at least two bytes to store one character, while UTF-8 requires at least one byte.
    But as for variable encoding, it means that if it requires 4 bytes (the biggest number of bytes a unicode char can use), then it would use two UTF-16 after each other or 4 UTF-8. Because it's variable length and not fixed.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    UTF-X uses multi-character sets for the characters that don't fit within the given number of bits. So a 6-byte encoding would take up 3 spaces in a UTF-16 format, and 6 spaces in a UTF-8 encoding.

    You are of course trading space for efficiency when choosing between UTF-8 and UTF-16. The frequency of characters that are not in the first (traditional) 255 characters will determine how much of a trade-off that is. If 99% of your characters are "below 255", then you should go with using char, if on the other hand, a lot of your characters are "above 255", then you should probably go with wchar_t.

    Note also that not all library functions "understand" multi-character sets. Or put another way, if you can't fit all characters in a space each in the character array, unexpected results may be the result.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  4. #4
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    More on UTF8, UTF16, and UTF32 encodings of Unicode "code points", or distinct characters: http://www.unicode.org/standard/prin...Encoding_Forms

    You have to be careful with what you choose because the size of wchar_t is implementation defined. In GNU's libc, wchar_t is 32 bits. In the Microsoft CRT and Platform SDK, wchar_t is 16 bits and explicitly expects UTF16LE for it's wide character API's.

    gg

  5. #5
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    If you're using UTF-x characters, it's probably better to create a typedef like this:
    Code:
    typedef unsigned char  UTF8;
    typedef wchar_t        UTF16;
    That way you can change the underlying type depending on the platform and only change it in one place.

  6. #6
    Registered User
    Join Date
    Aug 2005
    Posts
    96
    Quote Originally Posted by cpjust View Post
    If you're using UTF-x characters, it's probably better to create a typedef like this:
    Code:
    typedef unsigned char  UTF8;
    typedef wchar_t        UTF16;
    That way you can change the underlying type depending on the platform and only change it in one place.
    We already have those (well unless your compiler doesn't support C99).

    int8_t
    int16_t


  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by sethjackson View Post
    We already have those (well unless your compiler doesn't support C99).

    int8_t
    int16_t

    Yes, you may want to use those. But I would still recommend using a typedef to declare UTF8/UTF16 - that way, you have a possibility to change what UTF8 is represented as in the system at a later stage, without having to traipse through all code that happens to use int8_t for any other purpose.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Conversion Char To Char * Problem
    By ltanusaputra in forum Windows Programming
    Replies: 3
    Last Post: 03-01-2008, 02:06 PM
  2. get keyboard and mouse events
    By ratte in forum Linux Programming
    Replies: 10
    Last Post: 11-17-2007, 05:42 PM
  3. How do i un-SHA1 hash something..
    By willc0de4food in forum C Programming
    Replies: 4
    Last Post: 09-14-2005, 05:59 AM
  4. code condensing
    By bcianfrocca in forum C++ Programming
    Replies: 4
    Last Post: 09-07-2005, 09:22 AM
  5. String sorthing, file opening and saving.
    By j0hnb in forum C Programming
    Replies: 9
    Last Post: 01-23-2003, 01:18 AM