Thread: Unicode character size - preformance

  1. #1
    Unregistered User Yarin's Avatar
    Join Date
    Jul 2007
    Posts
    2,158

    Unicode character size - preformance

    When programming for Windows it doesn't take to long to find out that the preferred character size is 2 bytes (UTF-16LE). Generally, when programming for *nix I've only ever used 1 byte characters (that is, often UTF-8, not necessarily just ASCII). But, when programming a wxWidgets application for Linux, I discovered real quickly that it was using 4 byte characters!

    Of course, when it comes to storage, I can see that UTF-8 is almost always the way to go. But it made me wonder, are the speed considerations to be had? Does a 32+ bit processor preform 8, 16, and 32 calculations or RAM IO at different rates? Or does the clock ensure that instructions varying only in size of or less than the processor's word, take the same length of time?

  2. #2
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    I can't speak for other platforms but Windows is (mostly) based on the TCHAR which actually switches size with #define UNICODE, as do most of the API calls.

    In ASCII mode (Windows does not natively do UTF8) a TCHAR is defined as a char, but with the unicode define it is an unsigned short int (or wchar_t on some compilers).

    These are 8 bit and 16 bit bytes and words that are loaded into CPU registers at the same speed...

    Windows API doesn't internally transform character sizes (except for Registry access which is always 16 bit). Instead it uses separate function calls depending on character sizes... If you look in the windows headers you will find (for example) ReadFileA() and ReadFileW() with a #define ReadFile ReadFileA for ascii and #define ReadFile ReadFileW for unicode. The same is done with structs... there's an A and a W version of each.

    So unless there's some deficiency in the coding itself, you should not see a performance hit.
    Last edited by CommonTater; 04-05-2011 at 11:10 AM.

  3. #3
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Yarin View Post
    Of course, when it comes to storage, I can see that UTF-8 is almost always the way to go. But it made me wonder, are the speed considerations to be had? Does a 32+ bit processor preform 8, 16, and 32 calculations or RAM IO at different rates? Or does the clock ensure that instructions varying only in size of or less than the processor's word, take the same length of time?
    All bus transactions are the same -- when you load a 16-bit word, the CPU actually loads a 32-bit word then throws away the upper 16 bits of the result.

    UTF-8 is space efficient but not programmer friendly. Even a simple task like calculating the length of a string becomes complicated, because not all characters are encoded to the same width. With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is. With UTF-8 that's out the window.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  4. #4
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is.
    Not really.

    UTF16 requires two 16bit code units to represent code points greater than U+FFFF.

    UTF32 is always 1 code unit per code point. But that doesn't help with the number of "glyphs" in the string - thanks to combining characters.

    gg

  5. #5
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Codeplug View Post
    >> With a fixed-width encoding you can count the length of a string even if you don't know precisely what the encoding is.
    Not really.

    UTF16 requires two 16bit code units to represent code points greater than U+FFFF.
    yes, of course... because utf16 is not a fixed width encoding.

    Far as I know there are only 2 fixed width encodings in current use... ASCII and UTF32

  6. #6
    'Allo, 'Allo, Allo
    Join Date
    Apr 2008
    Posts
    639
    Quote Originally Posted by CommonTater View Post
    Windows API doesn't internally transform character sizes
    Every Windows API function that takes strings (even indirectly such as ShellExecuteEx) thunks from the A version to the W version using heap allocations and MultiByteToWideChar or equivalent. This is except for WinInet which thunks from W to A. It's not going to cause performance hotspots, but converting between encodings most defintely isn't free.

    If you look in the windows headers you will find (for example) ReadFileA() and ReadFileW() with a #define ReadFile ReadFileA for ascii and #define ReadFile ReadFileW for unicode.
    Sod's Law isn't it. Out of all such functions you could have chose from, ReadFile isn't one that does that.
    Last edited by adeyblue; 04-05-2011 at 01:58 PM.

  7. #7
    Unregistered User Yarin's Avatar
    Join Date
    Jul 2007
    Posts
    2,158
    Thanks for the responses. Yes, I'm familiar with A and W versions the Windows API has. In fact, I tend to directly use them in my code when ever I'm writing it with the intent of not being able to use in the other form (it can help clarify things a lot, actually). Also, UTF-32 or the equivalent, is, as far as I know the only fixed-length Unicode encoding. But in spite of this, the performance drop for dealing with this is usually quite minimal. I'm more concerned about the handling of just characters in general.

    Quote Originally Posted by brewbuck View Post
    All bus transactions are the same -- when you load a 16-bit word, the CPU actually loads a 32-bit word then throws away the upper 16 bits of the result
    This is what I was wondering. But how about memory? I know RAM is maintained through the processor, does it take longer to manipulate RAM data the size of the bus width than it does to manipulate, say, a single byte?

    Windows API doesn't internally transform character sizes (except for Registry access which is always 16 bit).
    Actually, Windows will directly store value and key names of either encoding (though perhaps not in the latest installments, I haven't checked them yet).

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    "Fixed width" is a loaded term.

    The way brewbuck first used it, I believe he was implying a 1-to-1 correspondence between the "glyph" and the code unit. In this context, no encoding of Unicode is "fixed width" (due to combining code points, format code points, etc).

    Not all code points represent a glyph.
    Glyphs can be represented by multiple code points.

    "Fixed width" could mean 1-to-1 correspondence between the code unit of the encoding and the encoded code point. In this context only UTF32 is "fixed width". (Unless Unicode ever breaks through the 32 bit threshold).

    gg

  9. #9
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Codeplug View Post
    "Fixed width" is a loaded term.

    The way brewbuck first used it, I believe he was implying a 1-to-1 correspondence between the "glyph" and the code unit. In this context, no encoding of Unicode is "fixed width" (due to combining code points, format code points, etc).

    Not all code points represent a glyph.
    Glyphs can be represented by multiple code points.

    "Fixed width" could mean 1-to-1 correspondence between the code unit of the encoding and the encoded code point. In this context only UTF32 is "fixed width". (Unless Unicode ever breaks through the 32 bit threshold).

    gg
    Some days I just ache for the good old days of 8 bit characters, no UAC bullcrap and simple strings...

  10. #10
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Not all code points represent a glyph.
    Glyphs can be represented by multiple code points.
    Also some codepoints represent multiple glyphs, some glyphs are multiple baseline wide, and some glyphs are positionally dependent based on the surrounding glyphs instead of just the codepoint.

    Internationalization for the win.

    Soma

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char Handling, probably typical newbie stuff
    By Neolyth in forum C Programming
    Replies: 16
    Last Post: 06-21-2009, 04:05 AM
  2. Replies: 16
    Last Post: 11-23-2007, 01:48 PM
  3. wide character (unicode) and multi-byte character
    By George2 in forum Windows Programming
    Replies: 6
    Last Post: 05-05-2007, 12:46 AM
  4. Where is my UNICODE extended wide character ?
    By intmail in forum Linux Programming
    Replies: 3
    Last Post: 02-15-2006, 10:20 AM
  5. Replies: 11
    Last Post: 03-25-2003, 05:13 PM