usage of char or wchar_t
If we are using variable size encodings (like UTF-8 or UTF-16), and we declare a char or a wchar_t for getting and storing the strings, how would it work in case of the memory required to store a particular character is more or less than the wchar_t size?
For eg, if wchar_t is 2 bytes in some compiler, and a unicode character requires 6 bytes how would it store? Or wouldn't it be a wasteage of memory for ASCII characters since it requires only one byte for ASCII characters in UTF-8 encoding?
It depends, because UTF-16 required at least two bytes to store one character, while UTF-8 requires at least one byte.
But as for variable encoding, it means that if it requires 4 bytes (the biggest number of bytes a unicode char can use), then it would use two UTF-16 after each other or 4 UTF-8. Because it's variable length and not fixed.
UTF-X uses multi-character sets for the characters that don't fit within the given number of bits. So a 6-byte encoding would take up 3 spaces in a UTF-16 format, and 6 spaces in a UTF-8 encoding.
You are of course trading space for efficiency when choosing between UTF-8 and UTF-16. The frequency of characters that are not in the first (traditional) 255 characters will determine how much of a trade-off that is. If 99% of your characters are "below 255", then you should go with using char, if on the other hand, a lot of your characters are "above 255", then you should probably go with wchar_t.
Note also that not all library functions "understand" multi-character sets. Or put another way, if you can't fit all characters in a space each in the character array, unexpected results may be the result.
More on UTF8, UTF16, and UTF32 encodings of Unicode "code points", or distinct characters: http://www.unicode.org/standard/prin...Encoding_Forms
You have to be careful with what you choose because the size of wchar_t is implementation defined. In GNU's libc, wchar_t is 32 bits. In the Microsoft CRT and Platform SDK, wchar_t is 16 bits and explicitly expects UTF16LE for it's wide character API's.
If you're using UTF-x characters, it's probably better to create a typedef like this:
That way you can change the underlying type depending on the platform and only change it in one place.
typedef unsigned char UTF8;
typedef wchar_t UTF16;
We already have those (well unless your compiler doesn't support C99).
Originally Posted by cpjust
Yes, you may want to use those. But I would still recommend using a typedef to declare UTF8/UTF16 - that way, you have a possibility to change what UTF8 is represented as in the system at a later stage, without having to traipse through all code that happens to use int8_t for any other purpose.
Originally Posted by sethjackson