Thread: const wchar_t * to UTF-8 const char* p

  1. #1
    Registered User
    Join Date
    Oct 2003
    Posts
    97

    const wchar_t * to UTF-8 const char* p

    I need to convert a const std::wstring * to a UTF-8 encoded const char * and I just can't make it work.

    I wrote the code to do that but it isn't working
    Code:
    //wideData is defined as const std::wstring *wideData
    size_t count = 0;
    char *convertedChar =  new char[wideData->size() + 1];
    
    count = wcstombs(convertedChar, wideData->c_str(), wideData->size());
    But while wideData->c_str() looks fine and wideData->size() too, after wcstombs count turns into 4 billion something, and convertedChar is just "". But if I replace the last line with
    Code:
    count = wcstombs(convertedChar, L"test", wideData->size());
    Then count is 4 and convertedChar is "test" so it is working there. What am I doing wrong?

    Thanks in advance

  2. #2
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Hulag
    after wcstombs count turns into 4 billion something.
    That sounds like the function is returning a fail code but you aren't checking for it.

    Do you have a small snippet for testing? I've been using this:
    Code:
    #include <iostream>
    #include <string>
    #include <locale>
    using namespace std;
    
    size_t foo(const std::wstring *wideData)
    {
       size_t count = 0;
       char *convertedChar =  new char[wideData->size() + 1];
       count = wcstombs(convertedChar, wideData->c_str(), wideData->size());
       if ( count != (size_t)-1 )
       {
          convertedChar[count] = '\0';
          cout << convertedChar << '\n';
       }
       delete[] convertedChar;
       return count;
    }
    
    int main()
    {
       const std::wstring text(L"Hello world");
       if ( foo(&text) == (size_t)-1 )
       {
          cerr << "Houston, we have a problem\n";
       }
    } 
    
    /* my output
    Hello world
    */
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  3. #3
    Registered User
    Join Date
    Oct 2003
    Posts
    97
    You are right, wcstombs is returning -1. How can I fix that?
    I don't have more code to show really, wideData comes from an Unicode BOM text file that has the following text:
    Code:
    <Config>
    	<Sound>
    		<File Name="music.wav" />
    	</Sound>
    	<Sound>
    		<File Name="music2.wav" />
    	</Sound>
    	<Sound>
    		<File Name="music3.wav" />
    	</Sound>
    </Config>
    And wideData shows that exact same text but wcstombs fails.

  4. #4
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Hmm. I'd say both of you have some memory allocation problems; there is no guarantee that it's possible to convert the string into only wideData->size() + 1 characters.

    In fact the presence of a single non-ASCII character in the input data would guarantee that only part of the string gets converted.

    A single Unicode code point can take up to four characters to encode in UTF-8 (technically up to six, but not all possible Unicode code points are used).

  5. #5
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Maybe this?
    If a wide-character code is encountered that does not correspond to a valid character (of one or more bytes each), wcstombs() shall return (size_t)-1.
    From http://www.opengroup.org/onlinepubs/.../wcstombs.html

    But that doesn't make sense.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  6. #6
    Registered User
    Join Date
    Oct 2003
    Posts
    97
    Well, even if I changed the amount of allocated memory to something like the following but it still fails. Also, there is no wierd character in wideData (there might be, but right now there isn't)
    Code:
    char *convertedChar = new char[2000000]; //Still doesn't work with this.

  7. #7
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    there is no guarantee that it's possible to convert the string into only wideData->size() + 1 characters.
    What size would you use then? wideData->size()*sizeof(wchar_t)+1?
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  8. #8
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Hulag
    I don't have more code to show really,
    Well could you post it anyway? That way your wifstream setup etc I won't have to invent.
    Quote Originally Posted by Hulag
    wideData comes from an Unicode BOM text file that has the following text:
    Perhaps you could attach the file so that the encoding will remain the same. My options for saving don't include a "UNICODE", but I've got plenty of flavors of UTF.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  9. #9
    Registered User
    Join Date
    Oct 2003
    Posts
    97
    Quote Originally Posted by Dave_Sinkula
    Well could you post it anyway? That way your wifstream setup etc I won't have to invent.
    I can't, the code is huge and doesn't belong to me. Sorry Anyway, is it really that important? The value of the wstring is fine, the debugger shows that the text is the exactly the same as the text in the file.

    Quote Originally Posted by Dave_Sinkula
    Perhaps you could attach the file so that the encoding will remain the same. My options for saving don't include a "UNICODE", but I've got plenty of flavors of UTF.
    Here I attached it.

  10. #10
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Hulag
    I can't, the code is huge and doesn't belong to me. Sorry
    I'm talking about creating a minimal snippet that demonstrates the issue exactly, not posting 99% irrelevant code.
    Quote Originally Posted by Hulag
    Anyway, is it really that important?
    My attempts at assistance end if I can't reproduce your issue(s). So that's your call.

    Thanks for posting the attachment. To verify, since I am in uncertain waters, the source file is UTF-16, right?
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  11. #11
    Registered User
    Join Date
    Oct 2003
    Posts
    97
    Well, I managed to fix the problem. The problem is in the byte order mark in the file at the beginning. I had to increase the pointer to the text by one to get rid of it, and then wcstombs would work. I hope people that have the same problem finds this thread to solve the problem.

    Thanks for the help guys

  12. #12
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Quote Originally Posted by dwks
    What size would you use then? wideData->size()*sizeof(wchar_t)+1?
    Worst case scenario is wideData->size()*4 + 1 assuming you're converting into UTF-8.

    The wcstombs function itself can tell you the size of the needed buffer, passing NULL as the arguments. The nice thing about that is it works for any encoding.

    If it can't convert the BOM though, I doubt it's converting into UTF-8 as the BOM is valid there too. Of course you should trash the BOM too anyway.
    Last edited by Cat; 11-25-2006 at 03:58 PM.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  13. #13
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    How do you know the function's converting to UTF-8, anyway? Did you set the appropriate locale?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Polynomials and ADT's
    By Emeighty in forum C++ Programming
    Replies: 20
    Last Post: 08-19-2008, 08:32 AM
  2. Undefined Reference Compiling Error
    By AlakaAlaki in forum C++ Programming
    Replies: 1
    Last Post: 06-27-2008, 11:45 AM
  3. Drawing Program
    By Max_Payne in forum C++ Programming
    Replies: 21
    Last Post: 12-21-2007, 05:34 PM
  4. Certain functions
    By Lurker in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2003, 01:26 AM
  5. Half-life SDK, where are the constants?
    By bennyandthejets in forum Game Programming
    Replies: 29
    Last Post: 08-25-2003, 11:58 AM