Thread: ASCII checking and UNICODE conversion

  1. #1
    Registered User
    Join Date
    Jan 2009
    Posts
    5

    ASCII checking and UNICODE conversion

    Greetings everybody,
    I need some advice for a simple program that I'm doing.
    I have to take input from stdin, check for any non-ascii character and replace it with an ascii character (7bit).
    I'm using fgetc and pass it an int instead of a char to check if the value is above 127. If that's the case, then proceed with the conversion.
    Now, considering that different environment equals different encodings, i.e. è is 130 in ASCII and 232 in ANSI, I thought about converting the single char to UNICODE, then convert again to what it should be in 7bit ASCII, i.e. è would become e. I read about wchar_t, and came up with this (just a short summary of the full code):
    Code:
    int func1()
    {
    int i;
    while ((i = fgetc(stdin)) != EOF)
     {
      //some code....
      if (i > 127)
       convert(&i);
     }
     return 1;
    }
    int convert(char *c);
    {
     wchar_t z = *c;
     if (z == 0x00E0) //'è' in UNICODE
      *c = 'e';
     else...//various checkings
     return 1
    }
    Then, just to be sure, I thought about printing z to check its value. I tried these:
    Code:
    printf(L"%lc", x);
    printf(L"%ls", x);
    printf("%lc", x);
    printf("%ls", x);
    and the above with wprintf(), but nothing displayed on screen.
    Do you think there is something wrong with what I'm doing, both in written code or in thought process?

  2. #2
    Sweet
    Join Date
    Aug 2002
    Location
    Tucson, Arizona
    Posts
    1,820
    Shouldn't you be using the wide character functions to do your data input.

    Look for wchar.h
    Woop?

  3. #3
    Registered User
    Join Date
    Jan 2009
    Posts
    5
    tried to use wchar as data type, now it prints on screen but this doesn't work as I expected:

    I thought by converting to UNICODE first, then I'd be able to determine what character I'm actually reading. Now it seems that this:
    Code:
    wchar_t a = fgetc(stdin);
    wprintf(L"CHAR: %lc", a);
    gives a different result from ASCII to ANSI, but even the program says the character is the same.
    Example: I write è. Both ANSI and ASCII say that the character passed was è, but only the ASCII console understands that I'm writing 0x00E8, the ANSI one doesn't. Am I missing something here?

  4. #4
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    If you're trying to do this in console mode on windows you should know that a console window can be unicode (utf16le) or ansi (ascii + code page) *but not both*. You cannot simultaneously display ascii and unicode in the same console window... It won't do it.

    Also you should know that Unicode is not convertable to ansi. When represented as multibyte characters some Unicode data points can expand to as many as 5 separate characters in utf8 which is totally undisplayable in ascii.

    In all truth there is *no such thing* as Unicode to ansi conversion. You can represent ansi and ascii characters as unicode... but not the other way around.

    I just finished a project in which I was required to convert Unicode text files in several different formats and languages to utf8 for internet access (via ftp) and it's anything but simple. Here's what it takes to simply open a unicode text file... converting it to utf8 takes about the same amount of code again. But it will never be a plain ascii file...
    Code:
    // open and translate file
    BOOL M3ULaunch(PWCHAR FileName)
      { PBYTE  rf;      // raw file data
        DWORD  br;      // bytes read
        // load the raw file
        { HANDLE pl;    // playlist file handle 
          DWORD  fs;    // file size
          // get path to file
          wcsncpy(FilePath,FileName,MAX_PATH);
          PathRemoveFileSpec(FilePath);
          wcscat(FilePath,L"\\");
          // open the file
          pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
          if (pl == INVALID_HANDLE_VALUE)
            Exception(GetLastError());
          fs = GetFileSize(pl,NULL);        
          rf = calloc(fs + 2, sizeof(BYTE));
          if (! ReadFile(pl, rf, fs, &br, NULL))
            Exception(GetLastError());
          CloseHandle(pl);  
          if (br != fs)
            Exception(0xE00640007); } 
        try                                   
         { DWORD bom = *(DWORD*)rf;
           if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
             Exception(0xE0640002);                         // utf32be bom  
           else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
             { FlipEndian(rf,br);
               CopyWChar((PWCHAR) rf + 1); }
           else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
             CopyWChar((PWCHAR) rf + 1);  
           else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
             CopyMByte(rf + 3, br - 3);
           else                                             // no known bom, probe the file
             { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
                 CopyMByte(rf,br);                          // ansi / utf8 no bom
               else 
                { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
                  if (!lf) 
                    Exception(0xE0640003);
                  if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                       (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                     Exception(0xE0640002);    
                  if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                    FlipEndian(rf,br);                      // utf16be no bom  
                  CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
         finally  
          { free(rf); }
        return 1; }

  5. #5
    Registered User
    Join Date
    Sep 2004
    Location
    California
    Posts
    3,268
    Just because you put the result into a wchar_t doesn't make it unicode. Unicode is just a character set, it can't even be represented into bytes without using an encoding. I think you're getting in a little over your head here. The fact that you're asking to convert non-ASCII characters to ASCII shows how much you need to learn. For instance, how would you convert the ﳍ character to ASCII?
    bit∙hub [bit-huhb] n. A source and destination for information.

Popular pages Recent additions subscribe to a feed