Unicode String Convertion Help

**JessicaS** · 03-18-2016

Hello,

I want to MessageBox a Unicode string by my Win32 application on VS 2015.

It is this string:

Code:

std::string text = "öüç";

Because MessageBox requires LPWSTR, I have to convert string to the LPWSTR by below code.

Code:

LPWSTR ConvertToLPWSTR(const std::string& s)
{
    LPWSTR ws = new wchar_t[s.size() + 1]; // +1 for zero at the end
    copy(s.begin(), s.end(), ws);
    ws[s.size()] = 0; // zero at the end
    return ws;
}

But although string is converted to LPWSTR, when program runs MessageBox shows it as weird and unrecognized characters.

Code:

std::string text = "öüç";
LPWSTR result = ConvertToLPWSTR(text);
MessageBox(0, result, TEXT("Title"), MB_OK);

Is there any issue on the algorithm or am I doing somethin wrong.

BTW, latin characters works fine but unicode characters create problem in string.

Cheers

**Codeplug** · 03-18-2016

First, don't use TCHAR's/TEXT() - just enabled Unicode in your project settings and use L"" string literals and wchar_t as your wide char type.

I'm also not a big fan of MS typedef's - I prefer "wchar_t*" instead of LPWSTR.

Next you should understand the pitfalls that can bite you when using extended character literals in your source code ("öüç"). In short, you'll need to use a wide string literal and save your source file as UTF8-with-BOM. In long, Non-English characters with cout

Here are some conversion utils you can use:

Code:

#include <windows.h>
#include <string>
#include <sstream>
#include <vector>

std::wstring str_to_wstr(const std::string &str, UINT cp = CP_ACP)
{
    int len = MultiByteToWideChar(cp, 0, str.c_str(), str.length(), 0, 0);
    if (!len)
        return L"ErrorA2W";
    
    std::vector<wchar_t> wbuff(len + 1);
    // NOTE: this does not NULL terminate the string in wbuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!MultiByteToWideChar(cp, 0, str.c_str(), str.length(), &wbuff[0], len))
        return L"ErrorA2W";

    return &wbuff[0];
}//str_to_wstr

std::string wstr_to_str(const std::wstring &wstr, UINT cp = CP_ACP)
{
    int len = WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(), 
                                  0, 0, 0, 0);
    if (!len)
        return "ErrorW2A";

    std::vector<char> abuff(len + 1);

    // NOTE: this does not NULL terminate the string in abuff, but this is ok
    //       since it was zero-initialized in the vector constructor
    if (!WideCharToMultiByte(cp, 0, wstr.c_str(), wstr.length(), 
                             &abuff[0], len, 0, 0))
    {
        return "ErrorW2A";
    }//if

    return &abuff[0];
}//wstr_to_str

But if you just save your source files as UTF8 and use the follow, it should work:

Code:

MessageBox(0, L"öüç", L"Title", MB_OK);

gg

**JessicaS** · 03-18-2016

Thank you very much for your reply @Codeplug.

Code:

            std::string text = "öüç";
            LPWSTR result = str_to_wstr(text);
            MessageBox(0, result, L"Title", MB_OK);

I switched to your function but error is thrown.

Code:

error C2440: 'initializing' : cannot convert from 'std::wstring' to 'LPWSTR'

In all cases MessageBox, requires LPWSTR. What should I do?

As you said I can type those characters by hand, but actually I am trying to MessageBox the pipe stdin that contains extended characters. So I save the pipe stdin as string variable and therefore a convertion is inevitable, I suppose.

To simplify the process I am just providing the variable manually in this topic. If I could convert string to LPWSTR correctly, I will try the real pipe example.

**JessicaS** · 03-18-2016

As you predicted the file was saved ANSI. I saved the CPP file as "UTF8" and "UFT8 with BOM". Both did not change the outcome.

**Codeplug** · 03-18-2016

>> std::string text = "öüç";
You don't want to do this [edit]on Windows[/edit].

>> str_to_wstr(text);
str_to_wstr() returns type std::wstring, which has no conversion operator to "wchar_t*" - and even if it did, it would on a temporary (not good).

This should work the same as the prior example (assuming you saved your source file as UTF8):

Code:

std::wstring text = L"öüç";
MessageBox(0, text.c_str(), L"Title", MB_OK);

gg

**JessicaS** · 03-18-2016

Thanks you very much again @Codeplug.

Yes by changing text variable type to wstring, it works.

But my original pipe code reads the stdin as plain string:

Code:

#include "stdafx.h"
#include "Test.h"
#include <iostream>
#include <string>


int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow){


    while (1){
        unsigned int length = 0;


        //Remove first 4 character
        for (int i = 0; i < 4; i++)
        {
            unsigned int read_char = getchar();
            length = length | (read_char << i * 8);
        }


        //read pipe message
        std::string text = "";
        
        for (size_t i = 0; i < length; i++)
        {
            text += getchar();
        }

        break;
    }

    LPWSTR result = str_to_wstr(text);
    MessageBox(0, result, L"Title", MB_OK);

    return 0;
}

So am I doing wrong by reading the pipe as plain string? Is it possible to read pipe as wstring?

**Codeplug** · 03-18-2016

Well, the key here is knowing how those characters are encoded. If they are ACP encoded then you would try:

Code:

std::wstring textw = str_to_wstr(text);
MessageBox(0, textw.c_str(), L"Title", MB_OK);

If they are UTF8 encoded, then pass CP_UTF8 as the second parameter to str_to_wstr().

gg

**JessicaS** · 03-18-2016

Originally Posted by Codeplug

Well, the key here is knowing how those characters are encoded. If they are ACP encoded then you would try:

Code:

std::wstring textw = str_to_wstr(text);
MessageBox(0, textw.c_str(), L"Title", MB_OK);

If they are UTF8 encoded, then pass CP_UTF8 as the second parameter to str_to_wstr().

gg

Thank you so much @Codeplug, work awesome when I add CP_UTF8.

BTW, how can I check which encoding system are characters encoded? Is it dependent on the source file encoding or user's OS language? Or just I just use error-trial?

Cheers

**Codeplug** · 03-18-2016

"Checking" the encoding is usually only done when there exists "something" for you to check. For files on Windows, you can check for a BOM. If there isn't one, assuming ACP encoding is typical. For standard C/C++ you could make your assumption by using the user's locale. Files like XML, can have an "encoding=" attribute than can be checked. Otherwise, there just has to be an agreement on what's what. Eg. Windows uses UTF16LE in its API's.

There isn't really a 100% reliable "analyse the bytes and determine encoding" method.

gg

**JessicaS** · 03-18-2016

Originally Posted by Codeplug

"Checking" the encoding is usually only done when there exists "something" for you to check. For files on Windows, you can check for a BOM. If there isn't one, assuming ACP encoding is typical. For standard C/C++ you could make your assumption by using the user's locale. Files like XML, can have an "encoding=" attribute than can be checked. Otherwise, there just has to be an agreement on what's what. Eg. Windows uses UTF16LE in its API's.

There isn't really a 100% reliable "analyse the bytes and determine encoding" method.

gg

Thanks again. Much appreciated.

Now my code works for extended characters. But does CP_UTF8 argument in your function also support characters like Chinese and Arabic?

I am curious if Chinese and Arabic users send their language characters through pipe and see the correct version in MessageBox. I can not even type those characters in my keyboard to test. Even not sure my environment is equal to theirs.

Is it my duty to support internalization or Windows' operating system?

Cheers

**Codeplug** · 03-18-2016

UTF8 supports all of Unicode.

"Duty" is strong wording

But if you turn on Unicode in your project settings, and use wchar_t based strings, then you're most of the way there. The last mile would be supporting translations for any static text.

There are websites you can use to lookup the UTF8 encoding of characters from other languages:
https://ssl.icu-project.org/icu-bin/...e?ch=3046#here
Browser Test Page for Unicode Character 'HIRAGANA LETTER U' (U+3046)

gg

**JessicaS** · 03-18-2016

Originally Posted by Codeplug

UTF8 supports all of Unicode.

"Duty" is strong wording

But if you turn on Unicode in your project settings, and use wchar_t based strings, then you're most of the way there. The last mile would be supporting translations for any static text.

There are websites you can use to lookup the UTF8 encoding of characters from other languages:
https://ssl.icu-project.org/icu-bin/...e?ch=3046#here
Browser Test Page for Unicode Character 'HIRAGANA LETTER U' (U+3046)

gg

Thanks again. Much helped.

**JessicaS** · 03-18-2016

Oh just forgot to ask @Codeplug

I use Unicode and wchar_t but Windows pipe is using simple string. Is not this a caveat? Can there be any lose about internalization characters during simple string stdin?

Cheers

**Codeplug** · 03-18-2016

When transporting bytes over a socket/stream, it's up to you on how to decompose and recompose those bytes - which includes the encoding of any strings.

>> length = length | (read_char << i * 8);
This is a perfect example. A four byte integer being decomposed/recomposed to/from a stream of bytes in big-endian order.

What is on the other end of the pipe? There are additional issues you can run into when dealing with standard I/O and the Windows console.

gg

**JessicaS** · 03-18-2016

Originally Posted by Codeplug

When transporting bytes over a socket/stream, it's up to you on how to decompose and recompose those bytes - which includes the encoding of any strings.

>> length = length | (read_char << i * 8);
This is a perfect example. A four byte integer being decomposed/recomposed to/from a stream of bytes in big-endian order.

What is on the other end of the pipe? There are additional issues you can run into when dealing with standard I/O and the Windows console.

gg

Thanks again. I get stdin as string, then convert it wchar_t and wstring but for stdout I also send as new simple string.

Just wanted to make sure string to wchart_t convertion does not cause any issue for internalization. I did not have any issue on my workstation but just want to make sure localization should also work for other users.

Thread: Unicode String Convertion Help

Thread Tools

Search Thread

Display

Unicode String Convertion Help

Similar Threads

char to unicode string and back

convertion string to char ch[]

Size [in bytes] of unicode string

Size of a unicode string

double <--> unicode string