const wchar_t * to UTF-8 const char* p

**Hulag** · 11-25-2006

I need to convert a const std::wstring * to a UTF-8 encoded const char * and I just can't make it work.

I wrote the code to do that but it isn't working

Code:

//wideData is defined as const std::wstring *wideData
size_t count = 0;
char *convertedChar =  new char[wideData->size() + 1];

count = wcstombs(convertedChar, wideData->c_str(), wideData->size());

But while wideData->c_str() looks fine and wideData->size() too, after wcstombs count turns into 4 billion something, and convertedChar is just "". But if I replace the last line with

Code:

count = wcstombs(convertedChar, L"test", wideData->size());

Then count is 4 and convertedChar is "test" so it is working there. What am I doing wrong?

Thanks in advance

**Dave_Sinkula** · 11-25-2006

Originally Posted by Hulag

after wcstombs count turns into 4 billion something.

That sounds like the function is returning a fail code but you aren't checking for it.

Do you have a small snippet for testing? I've been using this:

Code:

#include <iostream>
#include <string>
#include <locale>
using namespace std;

size_t foo(const std::wstring *wideData)
{
   size_t count = 0;
   char *convertedChar =  new char[wideData->size() + 1];
   count = wcstombs(convertedChar, wideData->c_str(), wideData->size());
   if ( count != (size_t)-1 )
   {
      convertedChar[count] = '\0';
      cout << convertedChar << '\n';
   }
   delete[] convertedChar;
   return count;
}

int main()
{
   const std::wstring text(L"Hello world");
   if ( foo(&text) == (size_t)-1 )
   {
      cerr << "Houston, we have a problem\n";
   }
} 

/* my output
Hello world
*/

**Hulag** · 11-25-2006

You are right, wcstombs is returning -1. How can I fix that?
I don't have more code to show really, wideData comes from an Unicode BOM text file that has the following text:

Code:

<Config>
	<Sound>
		<File Name="music.wav" />
	</Sound>
	<Sound>
		<File Name="music2.wav" />
	</Sound>
	<Sound>
		<File Name="music3.wav" />
	</Sound>
</Config>

And wideData shows that exact same text but wcstombs fails.

**Cat** · 11-25-2006

Hmm. I'd say both of you have some memory allocation problems; there is no guarantee that it's possible to convert the string into only wideData->size() + 1 characters.

In fact the presence of a single non-ASCII character in the input data would guarantee that only part of the string gets converted.

A single Unicode code point can take up to four characters to encode in UTF-8 (technically up to six, but not all possible Unicode code points are used).

**dwks** · 11-25-2006

Maybe this?

If a wide-character code is encountered that does not correspond to a valid character (of one or more bytes each), wcstombs() shall return (size_t)-1.

From http://www.opengroup.org/onlinepubs/.../wcstombs.html

But that doesn't make sense.

**Hulag** · 11-25-2006

Well, even if I changed the amount of allocated memory to something like the following but it still fails. Also, there is no wierd character in wideData (there might be, but right now there isn't)

Code:

char *convertedChar = new char[2000000]; //Still doesn't work with this.

**dwks** · 11-25-2006

there is no guarantee that it's possible to convert the string into only wideData->size() + 1 characters.

What size would you use then? wideData->size()*sizeof(wchar_t)+1?

**Dave_Sinkula** · 11-25-2006

Originally Posted by Hulag

I don't have more code to show really,

Well could you post it anyway? That way your wifstream setup etc I won't have to invent.

Originally Posted by Hulag

wideData comes from an Unicode BOM text file that has the following text:

Perhaps you could attach the file so that the encoding will remain the same. My options for saving don't include a "UNICODE", but I've got plenty of flavors of UTF.

**Hulag** · 11-25-2006

Originally Posted by Dave_Sinkula

Well could you post it anyway? That way your wifstream setup etc I won't have to invent.

I can't, the code is huge and doesn't belong to me. Sorry

Anyway, is it really that important? The value of the wstring is fine, the debugger shows that the text is the exactly the same as the text in the file.

Originally Posted by Dave_Sinkula

Perhaps you could attach the file so that the encoding will remain the same. My options for saving don't include a "UNICODE", but I've got plenty of flavors of UTF.

Here I attached it.

**Dave_Sinkula** · 11-25-2006

Originally Posted by Hulag

I can't, the code is huge and doesn't belong to me. Sorry

I'm talking about creating a minimal snippet that demonstrates the issue exactly, not posting 99% irrelevant code.

Originally Posted by Hulag

Anyway, is it really that important?

My attempts at assistance end if I can't reproduce your issue(s). So that's your call.

Thanks for posting the attachment. To verify, since I am in uncertain waters, the source file is UTF-16, right?

**Hulag** · 11-25-2006

Well, I managed to fix the problem. The problem is in the byte order mark in the file at the beginning. I had to increase the pointer to the text by one to get rid of it, and then wcstombs would work. I hope people that have the same problem finds this thread to solve the problem.

Thanks for the help guys

**Cat** · 11-25-2006

Originally Posted by dwks

What size would you use then? wideData->size()*sizeof(wchar_t)+1?

Worst case scenario is wideData->size()*4 + 1 assuming you're converting into UTF-8.

The wcstombs function itself can tell you the size of the needed buffer, passing NULL as the arguments. The nice thing about that is it works for any encoding.

If it can't convert the BOM though, I doubt it's converting into UTF-8 as the BOM is valid there too. Of course you should trash the BOM too anyway.

**CornedBee** · 11-25-2006

How do you know the function's converting to UTF-8, anyway? Did you set the appropriate locale?

Thread: const wchar_t * to UTF-8 const char* p

Thread Tools

Search Thread

Display

const wchar_t * to UTF-8 const char* p

Similar Threads

Polynomials and ADT's

Undefined Reference Compiling Error

Drawing Program

Certain functions

Half-life SDK, where are the constants?