Unicode character size - preformance

**Yarin** · 04-05-2011

When programming for Windows it doesn't take to long to find out that the preferred character size is 2 bytes (UTF-16LE). Generally, when programming for *nix I've only ever used 1 byte characters (that is, often UTF-8, not necessarily just ASCII). But, when programming a wxWidgets application for Linux, I discovered real quickly that it was using 4 byte characters!

Of course, when it comes to storage, I can see that UTF-8 is almost always the way to go. But it made me wonder, are the speed considerations to be had? Does a 32+ bit processor preform 8, 16, and 32 calculations or RAM IO at different rates? Or does the clock ensure that instructions varying only in size of or less than the processor's word, take the same length of time?

**~~CommonTater~~** · 04-05-2011

I can't speak for other platforms but Windows is (mostly) based on the TCHAR which actually switches size with #define UNICODE, as do most of the API calls.

In ASCII mode (Windows does not natively do UTF8) a TCHAR is defined as a char, but with the unicode define it is an unsigned short int (or wchar_t on some compilers).

These are 8 bit and 16 bit bytes and words that are loaded into CPU registers at the same speed...

Windows API doesn't internally transform character sizes (except for Registry access which is always 16 bit). Instead it uses separate function calls depending on character sizes... If you look in the windows headers you will find (for example) ReadFileA() and ReadFileW() with a #define ReadFile ReadFileA for ascii and #define ReadFile ReadFileW for unicode. The same is done with structs... there's an A and a W version of each.

So unless there's some deficiency in the coding itself, you should not see a performance hit.

**brewbuck** · 04-05-2011

Originally Posted by Yarin

Of course, when it comes to storage, I can see that UTF-8 is almost always the way to go. But it made me wonder, are the speed considerations to be had? Does a 32+ bit processor preform 8, 16, and 32 calculations or RAM IO at different rates? Or does the clock ensure that instructions varying only in size of or less than the processor's word, take the same length of time?

All bus transactions are the same -- when you load a 16-bit word, the CPU actually loads a 32-bit word then throws away the upper 16 bits of the result.

UTF-8 is space efficient but not programmer friendly. Even a simple task like calculating the length of a string becomes complicated, because not all characters are encoded to the same width. With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is. With UTF-8 that's out the window.

**Codeplug** · 04-05-2011

>> With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is.
Not really.

UTF16 requires two 16bit code units to represent code points greater than U+FFFF.

UTF32 is always 1 code unit per code point. But that doesn't help with the number of "glyphs" in the string - thanks to combining characters.

gg

**~~CommonTater~~** · 04-05-2011

Originally Posted by Codeplug

>> With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is.
Not really.

UTF16 requires two 16bit code units to represent code points greater than U+FFFF.

yes, of course... because utf16 is not a fixed width encoding.

Far as I know there are only 2 fixed width encodings in current use... ASCII and UTF32

**adeyblue** · 04-05-2011

Originally Posted by CommonTater

Windows API doesn't internally transform character sizes

Every Windows API function that takes strings (even indirectly such as ShellExecuteEx) thunks from the A version to the W version using heap allocations and MultiByteToWideChar or equivalent. This is except for WinInet which thunks from W to A. It's not going to cause performance hotspots, but converting between encodings most defintely isn't free.

If you look in the windows headers you will find (for example) ReadFileA() and ReadFileW() with a #define ReadFile ReadFileA for ascii and #define ReadFile ReadFileW for unicode.

Sod's Law isn't it. Out of all such functions you could have chose from, ReadFile isn't one that does that.

**Yarin** · 04-05-2011

Thanks for the responses. Yes, I'm familiar with A and W versions the Windows API has. In fact, I tend to directly use them in my code when ever I'm writing it with the intent of not being able to use in the other form (it can help clarify things a lot, actually). Also, UTF-32 or the equivalent, is, as far as I know the only fixed-length Unicode encoding. But in spite of this, the performance drop for dealing with this is usually quite minimal. I'm more concerned about the handling of just characters in general.

Originally Posted by brewbuck

All bus transactions are the same -- when you load a 16-bit word, the CPU actually loads a 32-bit word then throws away the upper 16 bits of the result

This is what I was wondering. But how about memory? I know RAM is maintained through the processor, does it take longer to manipulate RAM data the size of the bus width than it does to manipulate, say, a single byte?

Windows API doesn't internally transform character sizes (except for Registry access which is always 16 bit).

Actually, Windows will directly store value and key names of either encoding (though perhaps not in the latest installments, I haven't checked them yet).

**Codeplug** · 04-05-2011

"Fixed width" is a loaded term.

The way brewbuck first used it, I believe he was implying a 1-to-1 correspondence between the "glyph" and the code unit. In this context, no encoding of Unicode is "fixed width" (due to combining code points, format code points, etc).

Not all code points represent a glyph.
Glyphs can be represented by multiple code points.

"Fixed width" could mean 1-to-1 correspondence between the code unit of the encoding and the encoded code point. In this context only UTF32 is "fixed width". (Unless Unicode ever breaks through the 32 bit threshold).

gg

**~~CommonTater~~** · 04-05-2011

Originally Posted by Codeplug

"Fixed width" is a loaded term.

The way brewbuck first used it, I believe he was implying a 1-to-1 correspondence between the "glyph" and the code unit. In this context, no encoding of Unicode is "fixed width" (due to combining code points, format code points, etc).

Not all code points represent a glyph.
Glyphs can be represented by multiple code points.

"Fixed width" could mean 1-to-1 correspondence between the code unit of the encoding and the encoded code point. In this context only UTF32 is "fixed width". (Unless Unicode ever breaks through the 32 bit threshold).

gg

Some days I just ache for the good old days of 8 bit characters, no UAC bullcrap and simple strings...

**phantomotap** · 04-05-2011

Not all code points represent a glyph.
Glyphs can be represented by multiple code points.

Also some codepoints represent multiple glyphs, some glyphs are multiple baseline wide, and some glyphs are positionally dependent based on the surrounding glyphs instead of just the codepoint.

Internationalization for the win.

Soma

Thread: Unicode character size - preformance

Thread Tools

Search Thread

Display

Unicode character size - preformance

Similar Threads

char Handling, probably typical newbie stuff

[Tutorial] Implementing the Advanced Encryption Standard

wide character (unicode) and multi-byte character

Where is my UNICODE extended wide character ?

Something I'd like to share with all of you: simple frustum culling