usage of char or wchar_t

Printable View

06-10-2008
kirankri

usage of char or wchar_t

If we are using variable size encodings (like UTF-8 or UTF-16), and we declare a char or a wchar_t for getting and storing the strings, how would it work in case of the memory required to store a particular character is more or less than the wchar_t size?

For eg, if wchar_t is 2 bytes in some compiler, and a unicode character requires 6 bytes how would it store? Or wouldn't it be a wasteage of memory for ASCII characters since it requires only one byte for ASCII characters in UTF-8 encoding?
06-10-2008
Elysia

It depends, because UTF-16 required at least two bytes to store one character, while UTF-8 requires at least one byte.
But as for variable encoding, it means that if it requires 4 bytes (the biggest number of bytes a unicode char can use), then it would use two UTF-16 after each other or 4 UTF-8. Because it's variable length and not fixed.
06-10-2008
matsp

UTF-X uses multi-character sets for the characters that don't fit within the given number of bits. So a 6-byte encoding would take up 3 spaces in a UTF-16 format, and 6 spaces in a UTF-8 encoding.

You are of course trading space for efficiency when choosing between UTF-8 and UTF-16. The frequency of characters that are not in the first (traditional) 255 characters will determine how much of a trade-off that is. If 99% of your characters are "below 255", then you should go with using char, if on the other hand, a lot of your characters are "above 255", then you should probably go with wchar_t.

Note also that not all library functions "understand" multi-character sets. Or put another way, if you can't fit all characters in a space each in the character array, unexpected results may be the result.

--
Mats
06-10-2008
Codeplug

More on UTF8, UTF16, and UTF32 encodings of Unicode "code points", or distinct characters: http://www.unicode.org/standard/prin...Encoding_Forms

You have to be careful with what you choose because the size of wchar_t is implementation defined. In GNU's libc, wchar_t is 32 bits. In the Microsoft CRT and Platform SDK, wchar_t is 16 bits and explicitly expects UTF16LE for it's wide character API's.

gg
06-10-2008
cpjust

If you're using UTF-x characters, it's probably better to create a typedef like this:

Code:

typedef unsigned char UTF8; typedef wchar_t UTF16;

That way you can change the underlying type depending on the platform and only change it in one place.
06-10-2008
sethjackson

Quote:

Originally Posted by cpjust

If you're using UTF-x characters, it's probably better to create a typedef like this:

Code:

typedef unsigned char UTF8; typedef wchar_t UTF16;

That way you can change the underlying type depending on the platform and only change it in one place.

We already have those (well unless your compiler doesn't support C99).

int8_t
int16_t

:)
06-11-2008
matsp

Quote:

Originally Posted by sethjackson

We already have those (well unless your compiler doesn't support C99).

int8_t
int16_t

:)

Yes, you may want to use those. But I would still recommend using a typedef to declare UTF8/UTF16 - that way, you have a possibility to change what UTF8 is represented as in the system at a later stage, without having to traipse through all code that happens to use int8_t for any other purpose.

--
Mats