Thread: Size [in bytes] of unicode string

  1. #1
    Registered User
    Join Date
    May 2008
    Posts
    39

    Size [in bytes] of unicode string

    Hello

    How i can find the size (in bytes) of a unicode string?

    Thanks

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Here's one way that works:
    Code:
    offset_t bytesInWCharStr(wchar_t *wstr)
    {
      offset_t len;
    
      wchar_t *p = wstr;
      while(*p != 0) p++;
    
      len = (char *)p - (char *)wstr; 
      return len;
    }
    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Why not just wcslen?
    Code:
    offset_t bytesInWCharStr(wchar_t* wstr)
    {
        return wcslen(wstr) * sizeof(wchar_t);
    }
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Elysia View Post
    Why not just wcslen?
    Code:
    offset_t bytesInWCharStr(wchar_t* wstr)
    {
        return wcslen(wstr) * sizeof(wchar_t);
    }
    Yes, that would work. Although I'm not sure how wcslen() works on multi-word characters, which is part of the Unicode spec - whcih is why I posted the above (which is pretty much equivalent to wcslen() anyways - I don't believe that wcslen is heavily optimized.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    But it's not better than your own implementation...
    Although, from what I know, wide characters are wide characters, ant not necessarily multi-byte characters. Typically, char is used for that, and may require the use of functions other than strlen and wcslen, since they work on narrow/wide chars, respectively.
    And I don't know if I dare say wcslen isn't heavily optimized... it's basically just what you did. Traverse the string until it finds L'\0'.

    Real unicode implementation is much more difficult.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Elysia View Post
    But it's not better than your own implementation...
    Although, from what I know, wide characters are wide characters, ant not necessarily multi-byte characters. Typically, char is used for that, and may require the use of functions other than strlen and wcslen, since they work on narrow/wide chars, respectively.
    And I don't know if I dare say wcslen isn't heavily optimized... it's basically just what you did. Traverse the string until it finds L'\0'.

    Real unicode implementation is much more difficult.
    There are more than 65535 characters in the world, and unicode itself covers all of them. There are special prefix operations that say "the next character is not from the first 65535 characters, but from set number X". This works the same in UTF-8 as it does in UTF-16, except of course there are more UTF-8 "not from this set of 255" than there are "not from this set of 65535". UTF-32 is "linear", all characters fit in 4 bytes.

    But I think you are right that wcslen() is not "clever enough" to actually care about those multi-character characters - it just counts the set as two independent characters, only the disply function that converst a character code into a rasterized font will care about the encoding of the characters.

    Sometimes I make more work than I should for myself... ;-)

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  7. #7
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    I'm aware of how multi-byte unicode specifications work, but...
    I believe there's a reason why Visual Studio has ANSI, Unicode and Multi-byte configurations.
    Unicode seems like a fixed-width version of UTF-16 on Windows.
    To use real Unicode, you may have to choose multi-byte and use other functions to process the strings. Windows have several functions to deal with multi-byte strings.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #8
    Registered User
    Join Date
    May 2008
    Posts
    39
    Hello


    I tried the wcslen example, but does not worked.


    Code:
    #include<windows.h>
    #include<stdio.h>
    #include<stdlib.h>
    #include<conio.h>
    
    int main()
    {
    
    LPWSTR Wide;
    char str[1024];
    printf("Enter string: ");
    gets(str);
    
    printf("String Size(ANSI) in bytes: %d",strlen(str));
    
    if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide)))
    {
    printf("\n\n\n Conversion Error");
    getch();
    }
    
    printf("\nString size(Unicode) in bytes: %d",wcslen(Wide)* sizeof(wchar_t));
    
    
    getch();
    
    
    }


    The program gives the size on a ANSI string, but when string is Unicode the size is 0.

    Why that happen?
    Thanks

  9. #9
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    gets() is a bad idea. cpwiki.sf.net/gets
    getch() is also a bad idea, due to its unportability.

    Anyway, I think you encounter the results you do because of this:
    Code:
    if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,strlen(str),Wide,wcslen(Wide)))
    wcslen(Wide) isn't a meaningful value at this point, because Wide is just an uninitialized pointer. Right? (I'm never sure with these Hungarian notation things . . . .)
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  10. #10
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Learn to indent.
    The thing is that wcslen, just as strlen, tries to find the length of the actual string, not the buffer.
    Therefore, use sizeof.
    However, LPWSTR is a pointer (!), which is not a storage unit, and so you get errors.
    http://cpwiki.sourceforge.net/Common...kes_and_errors
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    In UTF-16, a single character may occupy more than 2 bytes. (UTF-16, see the G-clef example part way down, and the info on surrogate pairs.) Despite this, using the example string on Wikipedia, wcslen() returns 4 for the three character example on both Windows and FreeBSD. (the string water, z, g clef) my code: (not entirely sure about this, but MSDN note that wcslen does the wide-char version of strlen(), so the ouptut should be 4, despite there being 3 characters.)
    Code:
    #include <stdio.h>
    #include <wchar.h>
    
    int main()
    {
    	// Asian char ('water'), 'z', and G clef (4 bytes)
    	wchar_t str[] = {0x6C34, 0x007A, 0xD834, 0xDD1E, 0x0000};
    	
    	printf("Length: &#37;d\n", wcslen(str));
    	return 0;
    }
    Now, as for the OP's problem. As others have said, LPWSTR is just a obfuscated pointer, and is not pointing to anything. We need to allocate memory for it. MultiByteToWideChar makes this easy, thankfully, as we can have it figure out how much memory we need:
    Code:
    int length;
    LPWSTR Wide;
    length = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, NULL, 0);
    Wide = malloc(sizeof(WCHAR) * length);
    MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, str, -1, Wide, length);
    Also, note that I pass -1, instead of strlen(str). If you pas strlen(str), the resulting Wide string will not be null-terminated! (See MSDN docs). If you pass -1, MBTWC will calculate the length and null terminate. (ie, normal, expected behavior)
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  12. #12
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Older Windows NT systems (prior to Windows 2000) only support UCS-2.[3]. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages, possibly with Chinese Windows versions.
    It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.
    I know Windows still does not support UTF-8.
    Btw, Microsoft's implementation of wcslen is:

    Code:
            const wchar_t *eos = wcs;
            while( *eos++ ) ;
            return( (size_t)(eos - wcs - 1) );
    So it doesn't take into account multi-byte at all, but simply counts the number of wide characters (characters that are 2 bytes).
    Last edited by Elysia; 06-22-2008 at 04:29 AM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> It doesn't say for sure if XP supports UTF-16 or just UCS-2, though.
    MSDN states that 2000 and up supports characters outside the "basic multilingual plane" (BMP) - and therefore supports Unicode, using the encoding UTF-16LE.
    http://msdn.microsoft.com/en-us/libr...14(VS.85).aspx

    Quote Originally Posted by Elysia View Post
    Unicode seems like a fixed-width version of UTF-16 on Windows.
    To use real Unicode, you may have to choose multi-byte and use other functions to process the strings. Windows have several functions to deal with multi-byte strings.
    The support for UTF16 in Windows (2K and up) is support for "real Unicode". Prior to 2k, "UCS-2" would be the more appropriate term since they only support the BMP.

    "MBCS", as windows uses the term, is for support of pre-Unicode character sets - mainly for Asian languages. MSDN even suggests that "new applications" use Unicode instead of MBCS. (Another reason why I advocate *not* using the TCHAR mechanism - just go wide.)
    http://msdn.microsoft.com/en-us/libr...54(VS.85).aspx

    I think we're all in agreement that wcslen() returns the number of wchar_t's, and not the number of "code points", or "real language characters". If it did anything else, they would have to give it a different name - due to the C-standard and all .

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Adventures in labyrinth generation.
    By guesst in forum Game Programming
    Replies: 8
    Last Post: 10-12-2008, 01:30 PM
  2. Size of a unicode string
    By gadu in forum C Programming
    Replies: 5
    Last Post: 06-16-2008, 01:30 AM
  3. Replies: 16
    Last Post: 11-23-2007, 01:48 PM
  4. We Got _DEBUG Errors
    By Tonto in forum Windows Programming
    Replies: 5
    Last Post: 12-22-2006, 05:45 PM
  5. can anyone see anything wrong with this code
    By occ0708 in forum C++ Programming
    Replies: 6
    Last Post: 12-07-2004, 12:47 PM