Size [in bytes] of unicode string

**gadu** · 06-21-2008

Hello

How i can find the size (in bytes) of a unicode string?

Thanks

**matsp** · 06-21-2008

Here's one way that works:

Code:

offset_t bytesInWCharStr(wchar_t *wstr)
{
  offset_t len;

  wchar_t *p = wstr;
  while(*p != 0) p++;

  len = (char *)p - (char *)wstr; 
  return len;
}

--
Mats

**Elysia** · 06-21-2008

Why not just wcslen?

Code:

offset_t bytesInWCharStr(wchar_t* wstr)
{
    return wcslen(wstr) * sizeof(wchar_t);
}

**matsp** · 06-21-2008

Originally Posted by Elysia

Why not just wcslen?

Code:

offset_t bytesInWCharStr(wchar_t* wstr)
{
    return wcslen(wstr) * sizeof(wchar_t);
}

Yes, that would work. Although I'm not sure how wcslen() works on multi-word characters, which is part of the Unicode spec - whcih is why I posted the above (which is pretty much equivalent to wcslen() anyways - I don't believe that wcslen is heavily optimized.

--
Mats

**Elysia** · 06-21-2008

But it's not better than your own implementation...
Although, from what I know, wide characters are wide characters, ant not necessarily multi-byte characters. Typically, char is used for that, and may require the use of functions other than strlen and wcslen, since they work on narrow/wide chars, respectively.
And I don't know if I dare say wcslen isn't heavily optimized... it's basically just what you did. Traverse the string until it finds L'\0'.

Real unicode implementation is much more difficult.

**matsp** · 06-21-2008

Originally Posted by Elysia

But it's not better than your own implementation...
Although, from what I know, wide characters are wide characters, ant not necessarily multi-byte characters. Typically, char is used for that, and may require the use of functions other than strlen and wcslen, since they work on narrow/wide chars, respectively.
And I don't know if I dare say wcslen isn't heavily optimized... it's basically just what you did. Traverse the string until it finds L'\0'.

Real unicode implementation is much more difficult.

There are more than 65535 characters in the world, and unicode itself covers all of them. There are special prefix operations that say "the next character is not from the first 65535 characters, but from set number X". This works the same in UTF-8 as it does in UTF-16, except of course there are more UTF-8 "not from this set of 255" than there are "not from this set of 65535". UTF-32 is "linear", all characters fit in 4 bytes.

But I think you are right that wcslen() is not "clever enough" to actually care about those multi-character characters - it just counts the set as two independent characters, only the disply function that converst a character code into a rasterized font will care about the encoding of the characters.

Sometimes I make more work than I should for myself... ;-)

--
Mats

**Elysia** · 06-21-2008

I'm aware of how multi-byte unicode specifications work, but...
I believe there's a reason why Visual Studio has ANSI, Unicode and Multi-byte configurations.
Unicode seems like a fixed-width version of UTF-16 on Windows.
To use real Unicode, you may have to choose multi-byte and use other functions to process the strings. Windows have several functions to deal with multi-byte strings.

**gadu** · 06-21-2008

Hello

I tried the wcslen example, but does not worked.

Code:

#include<windows.h>
#include<stdio.h>
#include<stdlib.h>
#include<conio.h>

int main()
{

LPWSTR Wide;
char str[1024];
printf("Enter string: ");
gets(str);

printf("String Size(ANSI) in bytes: %d",strlen(str));

if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide)))
{
printf("\n\n\n Conversion Error");
getch();
}

printf("\nString size(Unicode) in bytes: %d",wcslen(Wide)* sizeof(wchar_t));


getch();


}

The program gives the size on a ANSI string, but when string is Unicode the size is 0.

Why that happen?
Thanks

**dwks** · 06-21-2008

gets() is a bad idea. cpwiki.sf.net/gets
getch() is also a bad idea, due to its unportability.

Anyway, I think you encounter the results you do because of this:

Code:

if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide)))

wcslen(Wide) isn't a meaningful value at this point, because Wide is just an uninitialized pointer. Right? (I'm never sure with these Hungarian notation things . . . .)

**Elysia** · 06-21-2008

Learn to indent.
The thing is that wcslen, just as strlen, tries to find the length of the actual string, not the buffer.
Therefore, use sizeof.
However, LPWSTR is a pointer (!), which is not a storage unit, and so you get errors.
http://cpwiki.sourceforge.net/Common...kes_and_errors

**Cactus_Hugger** · 06-22-2008

In UTF-16, a single character may occupy more than 2 bytes. (UTF-16, see the G-clef example part way down, and the info on surrogate pairs.) Despite this, using the example string on Wikipedia, wcslen() returns 4 for the three character example on both Windows and FreeBSD. (the string water, z, g clef) my code: (not entirely sure about this, but MSDN note that wcslen does the wide-char version of strlen(), so the ouptut should be 4, despite there being 3 characters.)

Code:

#include <stdio.h>
#include <wchar.h>

int main()
{
	// Asian char ('water'), 'z', and G clef (4 bytes)
	wchar_t str[] = {0x6C34, 0x007A, 0xD834, 0xDD1E, 0x0000};
	
	printf("Length: &#37;d\n", wcslen(str));
	return 0;
}

Now, as for the OP's problem. As others have said, LPWSTR is just a obfuscated pointer, and is not pointing to anything. We need to allocate memory for it. MultiByteToWideChar makes this easy, thankfully, as we can have it figure out how much memory we need:

Code:

int length;
LPWSTR Wide;
length = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, NULL, 0);
Wide = malloc(sizeof(WCHAR) * length);
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, Wide, length);

Also, note that I pass -1, instead of strlen(str). If you pas strlen(str), the resulting Wide string will not be null-terminated! (See MSDN docs). If you pass -1, MBTWC will calculate the length and null terminate. (ie, normal, expected behavior)

**Elysia** · 06-22-2008

Older Windows NT systems (prior to Windows 2000) only support UCS-2.[3]. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages, possibly with Chinese Windows versions.

It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.
I know Windows still does not support UTF-8.
Btw, Microsoft's implementation of wcslen is:

Code:

        const wchar_t *eos = wcs;
        while( *eos++ ) ;
        return( (size_t)(eos - wcs - 1) );

So it doesn't take into account multi-byte at all, but simply counts the number of wide characters (characters that are 2 bytes).

**Codeplug** · 06-22-2008

>> It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.
MSDN states that 2000 and up supports characters outside the "basic multilingual plane" (BMP) - and therefore supports Unicode, using the encoding UTF-16LE.
http://msdn.microsoft.com/en-us/libr...14(VS.85).aspx

Originally Posted by Elysia

Unicode seems like a fixed-width version of UTF-16 on Windows.
To use real Unicode, you may have to choose multi-byte and use other functions to process the strings. Windows have several functions to deal with multi-byte strings.

The support for UTF16 in Windows (2K and up) is support for "real Unicode". Prior to 2k, "UCS-2" would be the more appropriate term since they only support the BMP.

"MBCS", as windows uses the term, is for support of pre-Unicode character sets - mainly for Asian languages. MSDN even suggests that "new applications" use Unicode instead of MBCS. (Another reason why I advocate *not* using the TCHAR mechanism - just go wide.)
http://msdn.microsoft.com/en-us/libr...54(VS.85).aspx

I think we're all in agreement that wcslen() returns the number of wchar_t's, and not the number of "code points", or "real language characters". If it did anything else, they would have to give it a different name - due to the C-standard and all

.

gg

Thread: Size [in bytes] of unicode string

Thread Tools

Search Thread

Display

Size [in bytes] of unicode string

Similar Threads

Adventures in labyrinth generation.

Size of a unicode string

[Tutorial] Implementing the Advanced Encryption Standard

We Got _DEBUG Errors

can anyone see anything wrong with this code