Thread: Getting ASCII equivalent of UNICODE string

  1. #1
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130

    Getting ASCII equivalent of UNICODE string

    I would like to get a matching ASCII (7bit) string for a UNICODE string.

    Example: I would like to have "Lÿdia" transformed into "Lydia".

    I'm aware that this would be a good guess at best and irreversible. However, I only have ASCII and I have to fit UNICODE in there the best looking way possible.

    I tried narrowing and widening it again with the classic locale which is what I need in the end, but it doesn't do any transformations for characters unknown, it just replaces them with blanks. No surprise really. Is there any way to transform those characters, or do I have to create a huge lookup table wchar_t to char myself ?


    Code:
    #include <string>
    #include <locale>
    #include <iostream>
    
    std::wstring widen( const std::string& s, const std::locale& loc = std::locale() )
    {
    	std::wstring out;
    	
    	out.reserve( s.size() );
    
    	const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(loc);
    
    	for( std::string::size_type i = 0 ; i < s.size() ; ++i )
    	{
    		out.push_back( f.widen( s[i] ) );
    	}
    
    	return out;
    }
    
    std::string narrow( const std::wstring& s, const std::locale& loc = std::locale() )
    {
    	std::string out;
    	
    	out.reserve( s.size() );
    
    	const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(loc);
    
    	for( std::wstring::size_type i = 0 ; i < s.size() ; ++i )
    	{
    		out.push_back( f.narrow( s[i] ) );
    	}
    
    	return out;
    }
    
    std::wstring asciify( const std::wstring& text )
    {
    	std::string norm = narrow( text, std::locale::classic() );
    
    	return widen( norm, std::locale::classic() );
    }
    
    
    int main()
    {
    	std::wstring s = L"Lÿdia";
    
    	std::wcout << asciify( s ) << std::endl;
    	
    	system( "pause" );
    
    	return 0;
    }
    The solution doesn't have to be standard C++ only, MFC would be ok, too. But stl would be cool. And no, I don't have boost.
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by nvoigt View Post
    I would like to get a matching ASCII (7bit) string for a UNICODE string.

    Example: I would like to have "Lÿdia" transformed into "Lydia".
    That's not "matching" or even close to it. Unicode contains over 1 million characters and ASCII contains 96 characters.

    I'm aware that this would be a good guess at best and irreversible. However, I only have ASCII and I have to fit UNICODE in there the best looking way possible.
    They look like "just letters" to you. But wait until you start ........ing off people who can't represent their own name properly.

  3. #3
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Umlaut y can be represented in UTF-8, so the 2 byte wide character's first byte is unused. You could grab the last byte of the wide character and then use a lookup table to convert to 7-bit ascii. It's a hack.

    Todd

  4. #4
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    Quote Originally Posted by brewbuck View Post
    That's not "matching" or even close to it. Unicode contains over 1 million characters and ASCII contains 96 characters.
    I'm well aware of this. I didn't chose the target format to be 7bit Ascii. If you have another word instead of "matching" I'd be more than happy because googling something I don't even have a word for really sucks.

    They look like "just letters" to you. But wait until you start ........ing off people who can't represent their own name properly.
    I will take their anger over being spelt "Lydia" any day because that is the closest representation within the limits set to me. I don't think my boss will accept any excuses if we sent a letter to "L dia" or "Ldia" instead because this is something we could do better even in freakin' 70s ASCII.

    If we have Chinese people who's names are like "small-tree, house-on-the-side, triangle" I'm out of luck. But right now, I'd be happy if "L dia" had a letter instead of a blank
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  5. #5
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by nvoigt View Post
    I'm well aware of this. I didn't chose the target format to be 7bit Ascii. If you have another word instead of "matching" I'd be more than happy because googling something I don't even have a word for really sucks.
    You might call it matching the closest glyph you have, but it's not matching the character.

    If we have Chinese people who's names are like "small-tree, house-on-the-side, triangle" I'm out of luck. But right now, I'd be happy if "L dia" had a letter instead of a blank
    If you've got a large portion of people with non-ASCII characters in their names then why are you restricted to using ASCII?

  6. #6
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    We have only european addresses that were entered using a european character set, so right now, I'd be very surprised to find chinese characters. It's mostly european national character quirks, like äöü or n's with various forms of dashes, dots and hyphens on top or danish A's with small globes... the standard stuff. It's saved as UNICODE in the database. Some legacy applications cannot handle UNICODE. So I need a representation that comes close.

    As a side effect, being able to normalize all those characters to a single base character would really help in name matching. While n with a circumflex, n with a circumgraph, n with breve and n with a tilde on top might be totally different characters, I would not trust an OCR software to pick the right one off a soaked postcard.
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  7. #7
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by nvoigt View Post
    We have only european addresses that were entered using a european character set, so right now, I'd be very surprised to find chinese characters. It's mostly european national character quirks, like äöü or n's with various forms of dashes, dots and hyphens on top or danish A's with small globes... the standard stuff. It's saved as UNICODE in the database. Some legacy applications cannot handle UNICODE. So I need a representation that comes close.
    My question is, why do you have to transform it from Unicode in the first place? You mentioned printing. What exactly are you doing?

  8. #8
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You have to create the lookup table yourself. Oh, and it can't be a wchar_t to char table. I don't need to tell you that &#223; will be prominent in European addresses, and it should be replaced by ss.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    I was able to get "unac" running under windows - after re-writting the front-end for wchar_t strings...

    Original source: http://www.senga.org/download/unac/
    Modified source attached (rename to .zip)

    Code:
    #include <iostream>
    using namespace std;
    
    #include "unac.h"
    
    int unac_string(const wchar_t *in, size_t in_length,
                    wchar_t *out, size_t &out_length)
    {
        if (!out || !out_length)
            return -1;
    
        if (in_length == -1)
            in_length = wcslen(in);
    
        int out_size = out_length;
        out_length = 0;
    
        int i;
        for (i = 0; i < in_length; ++i) 
        {
            wchar_t *p;
            int l;
            wchar_t c = in[i];
    
            // Lookup the tables for decomposition information
            unac_char_utf16(c, p, l);
    
            // Make sure there is enough space to hold the decomposition
            if (out_length + l + 1 >= out_size) 
                return -2;
    
            if (l > 0) 
            {
                // If there is a decomposition, insert it in the output string.
                for (int k = 0; k < l; ++k) 
                    out[out_length++] = p[k];
            } 
            else 
            {
                // If there is no decomposition leave it unchanged
                out[out_length++] = in[i];
            }//else
        }//for
    
        out[out_length] = '\0';
        return 0;
    }//unac_string
    
    
    int main()
    {
        const wchar_t *p = //L"Lÿdia ñ";
    
            L"\"La mort d'Olivier Bécaille\" -- Émile Zola;\n"
            L"\"Das Vermächtnis des alten Pilgers\" von Rainer M. Schröder (Österreich, ÖVP);\n"
            L"\"Smyccový koncert As dur\" -- Antonín Dvorák.";
    
        wchar_t out_buff[1024];
        size_t out_len = sizeof(out_buff)/sizeof(*out_buff);
        
        int res = unac_string(p, -1, out_buff, out_len);
    
        cout << res << endl;
        cout << out_len << endl;
    
        char buff[1024];
        wcstombs(buff, out_buff, sizeof(buff));
    
        cout << buff << endl;
    
        // check buff for any non-7bit characters
        
        return 0;
    }//main
    However, this won't convert ß to ss. For something like that there's the ICU library: http://icu-project.org/
    You can play with the differenct transformations it can do here: http://demo.icu-project.org/icu-bin/translit

    gg

  10. #10
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    Great, thanks, I'll give it a try


    @brewbuck
    The whole application (4-tier beast) cannot handle unicode, only the database backend can. We will provide unicode support "as soon as we have the time". Like... never. I'm preaching this for 5 years now.. I guess we will only find the time when we lose money due to asian clients.

    @CornedBee
    German Umlauts are already checked before this. Are there any other non-ascii characters that can be represented by a combination of ascii characters ?
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  11. #11
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    I can think of &#230;, at least.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  12. #12
    Internet Superhero
    Join Date
    Sep 2006
    Location
    Denmark
    Posts
    964
    Quote Originally Posted by CornedBee View Post
    I can think of æ, at least.
    æ - ø - å?
    How I need a drink, alcoholic in nature, after the heavy lectures involving quantum mechanics.

  13. #13
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    In Swedish, common translation are: ä -> ae, å -> aa, ö -> oe - Obviously same for the upper-case versions.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #14
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    I found this website:
    http://ahinea.com/en/tech/accented-translate.html

    And I found a couple more exceptions that the unac tables don't handle:
    Code:
    00df - ß -> ss
    0152 - Π-> Oe
    0153 - œ -> oe
    The tables do handle Æ and IJ however, which is interesting since they don't seem like "accented" characters to me.

    Also, only a handfull of characters were handled in that last group of characters from that website:
    Code:
    tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
    tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
    tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊʼnŋØøſ
    tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/;                   # ÞŦþŧ
    gg

  15. #15
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    Wouldn't it be easier to just verify that the characters are ASCII right at the source, and if the user enters something else, give them an error. That way the user can do the character translation for you.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. String Class
    By BKurosawa in forum C++ Programming
    Replies: 117
    Last Post: 08-09-2007, 01:02 AM
  2. We Got _DEBUG Errors
    By Tonto in forum Windows Programming
    Replies: 5
    Last Post: 12-22-2006, 05:45 PM
  3. RicBot
    By John_ in forum C++ Programming
    Replies: 8
    Last Post: 06-13-2006, 06:52 PM
  4. Program using classes - keeps crashing
    By webren in forum C++ Programming
    Replies: 4
    Last Post: 09-16-2005, 03:58 PM
  5. converting a string to it equivalent ascii value
    By pauljhot in forum C Programming
    Replies: 12
    Last Post: 02-16-2002, 02:35 AM