Hello
How i can find the size (in bytes) of a unicode string?
Thanks
Hello
How i can find the size (in bytes) of a unicode string?
Thanks
Here's one way that works:
--Code:offset_t bytesInWCharStr(wchar_t *wstr) { offset_t len; wchar_t *p = wstr; while(*p != 0) p++; len = (char *)p - (char *)wstr; return len; }
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
But it's not better than your own implementation...
Although, from what I know, wide characters are wide characters, ant not necessarily multi-byte characters. Typically, char is used for that, and may require the use of functions other than strlen and wcslen, since they work on narrow/wide chars, respectively.
And I don't know if I dare say wcslen isn't heavily optimized... it's basically just what you did. Traverse the string until it finds L'\0'.
Real unicode implementation is much more difficult.
There are more than 65535 characters in the world, and unicode itself covers all of them. There are special prefix operations that say "the next character is not from the first 65535 characters, but from set number X". This works the same in UTF-8 as it does in UTF-16, except of course there are more UTF-8 "not from this set of 255" than there are "not from this set of 65535". UTF-32 is "linear", all characters fit in 4 bytes.
But I think you are right that wcslen() is not "clever enough" to actually care about those multi-character characters - it just counts the set as two independent characters, only the disply function that converst a character code into a rasterized font will care about the encoding of the characters.
Sometimes I make more work than I should for myself... ;-)
--
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
I'm aware of how multi-byte unicode specifications work, but...
I believe there's a reason why Visual Studio has ANSI, Unicode and Multi-byte configurations.
Unicode seems like a fixed-width version of UTF-16 on Windows.
To use real Unicode, you may have to choose multi-byte and use other functions to process the strings. Windows have several functions to deal with multi-byte strings.
Hello
I tried the wcslen example, but does not worked.
Code:#include<windows.h> #include<stdio.h> #include<stdlib.h> #include<conio.h> int main() { LPWSTR Wide; char str[1024]; printf("Enter string: "); gets(str); printf("String Size(ANSI) in bytes: %d",strlen(str)); if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide))) { printf("\n\n\n Conversion Error"); getch(); } printf("\nString size(Unicode) in bytes: %d",wcslen(Wide)* sizeof(wchar_t)); getch(); }
The program gives the size on a ANSI string, but when string is Unicode the size is 0.
Why that happen?
Thanks
gets() is a bad idea. cpwiki.sf.net/gets
getch() is also a bad idea, due to its unportability.
Anyway, I think you encounter the results you do because of this:
wcslen(Wide) isn't a meaningful value at this point, because Wide is just an uninitialized pointer. Right? (I'm never sure with these Hungarian notation things . . . .)Code:if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide)))
dwk
Seek and ye shall find. quaere et invenies.
"Simplicity does not precede complexity, but follows it." -- Alan Perlis
"Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
"The only real mistake is the one from which we learn nothing." -- John Powell
Other boards: DaniWeb, TPS
Unofficial Wiki FAQ: cpwiki.sf.net
My website: http://dwks.theprogrammingsite.com/
Projects: codeform, xuni, atlantis, nort, etc.
Learn to indent.
The thing is that wcslen, just as strlen, tries to find the length of the actual string, not the buffer.
Therefore, use sizeof.
However, LPWSTR is a pointer (!), which is not a storage unit, and so you get errors.
http://cpwiki.sourceforge.net/Common...kes_and_errors
In UTF-16, a single character may occupy more than 2 bytes. (UTF-16, see the G-clef example part way down, and the info on surrogate pairs.) Despite this, using the example string on Wikipedia, wcslen() returns 4 for the three character example on both Windows and FreeBSD. (the string water, z, g clef) my code: (not entirely sure about this, but MSDN note that wcslen does the wide-char version of strlen(), so the ouptut should be 4, despite there being 3 characters.)
Now, as for the OP's problem. As others have said, LPWSTR is just a obfuscated pointer, and is not pointing to anything. We need to allocate memory for it. MultiByteToWideChar makes this easy, thankfully, as we can have it figure out how much memory we need:Code:#include <stdio.h> #include <wchar.h> int main() { // Asian char ('water'), 'z', and G clef (4 bytes) wchar_t str[] = {0x6C34, 0x007A, 0xD834, 0xDD1E, 0x0000}; printf("Length: %d\n", wcslen(str)); return 0; }
Also, note that I pass -1, instead of strlen(str). If you pas strlen(str), the resulting Wide string will not be null-terminated! (See MSDN docs). If you pass -1, MBTWC will calculate the length and null terminate. (ie, normal, expected behavior)Code:int length; LPWSTR Wide; length = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, NULL, 0); Wide = malloc(sizeof(WCHAR) * length); MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, Wide, length);
long time; /* know C? */
Unprecedented performance: Nothing ever ran this slow before.
Any sufficiently advanced bug is indistinguishable from a feature.
Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
The best way to accelerate an IBM is at 9.8 m/s/s.
recursion (re - cur' - zhun) n. 1. (see recursion)
It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.Older Windows NT systems (prior to Windows 2000) only support UCS-2.[3]. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages, possibly with Chinese Windows versions.
I know Windows still does not support UTF-8.
Btw, Microsoft's implementation of wcslen is:
So it doesn't take into account multi-byte at all, but simply counts the number of wide characters (characters that are 2 bytes).Code:const wchar_t *eos = wcs; while( *eos++ ) ; return( (size_t)(eos - wcs - 1) );
>> It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.
MSDN states that 2000 and up supports characters outside the "basic multilingual plane" (BMP) - and therefore supports Unicode, using the encoding UTF-16LE.
http://msdn.microsoft.com/en-us/libr...14(VS.85).aspx
The support for UTF16 in Windows (2K and up) is support for "real Unicode". Prior to 2k, "UCS-2" would be the more appropriate term since they only support the BMP.
"MBCS", as windows uses the term, is for support of pre-Unicode character sets - mainly for Asian languages. MSDN even suggests that "new applications" use Unicode instead of MBCS. (Another reason why I advocate *not* using the TCHAR mechanism - just go wide.)
http://msdn.microsoft.com/en-us/libr...54(VS.85).aspx
I think we're all in agreement that wcslen() returns the number of wchar_t's, and not the number of "code points", or "real language characters". If it did anything else, they would have to give it a different name - due to the C-standard and all .
gg