Thread: Unicode String Convertion Help

  1. #1
    Registered User
    Join Date
    Mar 2016
    Posts
    26

    Unicode String Convertion Help

    Hello,

    I want to MessageBox a Unicode string by my Win32 application on VS 2015.

    It is this string:

    Code:
    std::string text = "öüç";
    Because MessageBox requires LPWSTR, I have to convert string to the LPWSTR by below code.

    Code:
    LPWSTR ConvertToLPWSTR(const std::string& s)
    {
        LPWSTR ws = new wchar_t[s.size() + 1]; // +1 for zero at the end
        copy(s.begin(), s.end(), ws);
        ws[s.size()] = 0; // zero at the end
        return ws;
    }
    But although string is converted to LPWSTR, when program runs MessageBox shows it as weird and unrecognized characters.

    Code:
    std::string text = "öüç";
    LPWSTR result = ConvertToLPWSTR(text);
    MessageBox(0, result, TEXT("Title"), MB_OK);
    Is there any issue on the algorithm or am I doing somethin wrong.

    BTW, latin characters works fine but unicode characters create problem in string.

    Cheers

  2. #2
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    First, don't use TCHAR's/TEXT() - just enabled Unicode in your project settings and use L"" string literals and wchar_t as your wide char type.

    I'm also not a big fan of MS typedef's - I prefer "wchar_t*" instead of LPWSTR.

    Next you should understand the pitfalls that can bite you when using extended character literals in your source code ("öüç"). In short, you'll need to use a wide string literal and save your source file as UTF8-with-BOM. In long, Non-English characters with cout

    Here are some conversion utils you can use:
    Code:
    #include <windows.h>
    #include <string>
    #include <sstream>
    #include <vector>
    
    std::wstring str_to_wstr(const std::string &str, UINT cp = CP_ACP)
    {
        int len = MultiByteToWideChar(cp, 0, str.c_str(), str.length(), 0, 0);
        if (!len)
            return L"ErrorA2W";
        
        std::vector<wchar_t> wbuff(len + 1);
        // NOTE: this does not NULL terminate the string in wbuff, but this is ok
        //       since it was zero-initialized in the vector constructor
        if (!MultiByteToWideChar(cp, 0, str.c_str(), str.length(), &wbuff[0], len))
            return L"ErrorA2W";
    
        return &wbuff[0];
    }//str_to_wstr
    
    std::string wstr_to_str(const std::wstring &wstr, UINT cp = CP_ACP)
    {
        int len = WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(), 
                                      0, 0, 0, 0);
        if (!len)
            return "ErrorW2A";
    
        std::vector<char> abuff(len + 1);
    
        // NOTE: this does not NULL terminate the string in abuff, but this is ok
        //       since it was zero-initialized in the vector constructor
        if (!WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(), 
                                 &abuff[0], len, 0, 0))
        {
            return "ErrorW2A";
        }//if
    
        return &abuff[0];
    }//wstr_to_str
    But if you just save your source files as UTF8 and use the follow, it should work:
    Code:
    MessageBox(0, L"öüç", L"Title", MB_OK);
    gg

  3. #3
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Thank you very much for your reply @Codeplug.

    Code:
                std::string text = "öüç";
                LPWSTR result = str_to_wstr(text);
                MessageBox(0, result, L"Title", MB_OK);
    I switched to your function but error is thrown.

    Code:
    error C2440: 'initializing' : cannot convert from 'std::wstring' to 'LPWSTR'
    In all cases MessageBox, requires LPWSTR. What should I do?

    As you said I can type those characters by hand, but actually I am trying to MessageBox the pipe stdin that contains extended characters. So I save the pipe stdin as string variable and therefore a convertion is inevitable, I suppose.

    To simplify the process I am just providing the variable manually in this topic. If I could convert string to LPWSTR correctly, I will try the real pipe example.

  4. #4
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    As you predicted the file was saved ANSI. I saved the CPP file as "UTF8" and "UFT8 with BOM". Both did not change the outcome.

  5. #5
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> std::string text = "öüç";
    You don't want to do this [edit]on Windows[/edit].

    >> str_to_wstr(text);
    str_to_wstr() returns type std::wstring, which has no conversion operator to "wchar_t*" - and even if it did, it would on a temporary (not good).

    This should work the same as the prior example (assuming you saved your source file as UTF8):
    Code:
    std::wstring text = L"öüç";
    MessageBox(0, text.c_str(), L"Title", MB_OK);
    gg

  6. #6
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Thanks you very much again @Codeplug.

    Yes by changing text variable type to wstring, it works.

    But my original pipe code reads the stdin as plain string:

    Code:
    #include "stdafx.h"
    #include "Test.h"
    #include <iostream>
    #include <string>
    
    
    int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow){
    
    
        while (1){
            unsigned int length = 0;
    
    
            //Remove first 4 character
            for (int i = 0; i < 4; i++)
            {
                unsigned int read_char = getchar();
                length = length | (read_char << i * 8);
            }
    
    
            //read pipe message
            std::string text = "";
            
            for (size_t i = 0; i < length; i++)
            {
                text += getchar();
            }
    
            break;
        }
    
        LPWSTR result = str_to_wstr(text);
        MessageBox(0, result, L"Title", MB_OK);
    
        return 0;
    }
    So am I doing wrong by reading the pipe as plain string? Is it possible to read pipe as wstring?

  7. #7
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Well, the key here is knowing how those characters are encoded. If they are ACP encoded then you would try:
    Code:
    std::wstring textw = str_to_wstr(text);
    MessageBox(0, textw.c_str(), L"Title", MB_OK);
    If they are UTF8 encoded, then pass CP_UTF8 as the second parameter to str_to_wstr().

    gg

  8. #8
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Quote Originally Posted by Codeplug View Post
    Well, the key here is knowing how those characters are encoded. If they are ACP encoded then you would try:
    Code:
    std::wstring textw = str_to_wstr(text);
    MessageBox(0, textw.c_str(), L"Title", MB_OK);
    If they are UTF8 encoded, then pass CP_UTF8 as the second parameter to str_to_wstr().

    gg
    Thank you so much @Codeplug, work awesome when I add CP_UTF8.

    BTW, how can I check which encoding system are characters encoded? Is it dependent on the source file encoding or user's OS language? Or just I just use error-trial?

    Cheers

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    "Checking" the encoding is usually only done when there exists "something" for you to check. For files on Windows, you can check for a BOM. If there isn't one, assuming ACP encoding is typical. For standard C/C++ you could make your assumption by using the user's locale. Files like XML, can have an "encoding=" attribute than can be checked. Otherwise, there just has to be an agreement on what's what. Eg. Windows uses UTF16LE in its API's.

    There isn't really a 100% reliable "analyse the bytes and determine encoding" method.

    gg

  10. #10
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Quote Originally Posted by Codeplug View Post
    "Checking" the encoding is usually only done when there exists "something" for you to check. For files on Windows, you can check for a BOM. If there isn't one, assuming ACP encoding is typical. For standard C/C++ you could make your assumption by using the user's locale. Files like XML, can have an "encoding=" attribute than can be checked. Otherwise, there just has to be an agreement on what's what. Eg. Windows uses UTF16LE in its API's.

    There isn't really a 100% reliable "analyse the bytes and determine encoding" method.

    gg
    Thanks again. Much appreciated.

    Now my code works for extended characters. But does CP_UTF8 argument in your function also support characters like Chinese and Arabic?

    I am curious if Chinese and Arabic users send their language characters through pipe and see the correct version in MessageBox. I can not even type those characters in my keyboard to test. Even not sure my environment is equal to theirs.

    Is it my duty to support internalization or Windows' operating system?

    Cheers

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    UTF8 supports all of Unicode.

    "Duty" is strong wording But if you turn on Unicode in your project settings, and use wchar_t based strings, then you're most of the way there. The last mile would be supporting translations for any static text.

    There are websites you can use to lookup the UTF8 encoding of characters from other languages:
    https://ssl.icu-project.org/icu-bin/...e?ch=3046#here
    Browser Test Page for Unicode Character 'HIRAGANA LETTER U' (U+3046)

    gg

  12. #12
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Quote Originally Posted by Codeplug View Post
    UTF8 supports all of Unicode.

    "Duty" is strong wording But if you turn on Unicode in your project settings, and use wchar_t based strings, then you're most of the way there. The last mile would be supporting translations for any static text.

    There are websites you can use to lookup the UTF8 encoding of characters from other languages:
    https://ssl.icu-project.org/icu-bin/...e?ch=3046#here
    Browser Test Page for Unicode Character 'HIRAGANA LETTER U' (U+3046)

    gg
    Thanks again. Much helped.

  13. #13
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Oh just forgot to ask @Codeplug

    I use Unicode and wchar_t but Windows pipe is using simple string. Is not this a caveat? Can there be any lose about internalization characters during simple string stdin?

    Cheers

  14. #14
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    When transporting bytes over a socket/stream, it's up to you on how to decompose and recompose those bytes - which includes the encoding of any strings.

    >> length = length | (read_char << i * 8);
    This is a perfect example. A four byte integer being decomposed/recomposed to/from a stream of bytes in big-endian order.

    What is on the other end of the pipe? There are additional issues you can run into when dealing with standard I/O and the Windows console.

    gg

  15. #15
    Registered User
    Join Date
    Mar 2016
    Posts
    26
    Quote Originally Posted by Codeplug View Post
    When transporting bytes over a socket/stream, it's up to you on how to decompose and recompose those bytes - which includes the encoding of any strings.

    >> length = length | (read_char << i * 8);
    This is a perfect example. A four byte integer being decomposed/recomposed to/from a stream of bytes in big-endian order.

    What is on the other end of the pipe? There are additional issues you can run into when dealing with standard I/O and the Windows console.

    gg
    Thanks again. I get stdin as string, then convert it wchar_t and wstring but for stdout I also send as new simple string.

    Just wanted to make sure string to wchart_t convertion does not cause any issue for internalization. I did not have any issue on my workstation but just want to make sure localization should also work for other users.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char to unicode string and back
    By nime in forum C++ Programming
    Replies: 2
    Last Post: 03-28-2010, 01:02 PM
  2. convertion string to char ch[]
    By cubimongoloid in forum C++ Programming
    Replies: 6
    Last Post: 05-29-2009, 05:23 PM
  3. Size [in bytes] of unicode string
    By gadu in forum C Programming
    Replies: 12
    Last Post: 06-22-2008, 09:48 AM
  4. Size of a unicode string
    By gadu in forum C Programming
    Replies: 5
    Last Post: 06-16-2008, 01:30 AM
  5. double <--> unicode string
    By hartwork in forum Windows Programming
    Replies: 4
    Last Post: 04-07-2004, 07:43 PM