ASCII checking and UNICODE conversion

**TheComedian** · 03-22-2011

Greetings everybody,
I need some advice for a simple program that I'm doing.
I have to take input from stdin, check for any non-ascii character and replace it with an ascii character (7bit).
I'm using fgetc and pass it an int instead of a char to check if the value is above 127. If that's the case, then proceed with the conversion.
Now, considering that different environment equals different encodings, i.e. è is 130 in ASCII and 232 in ANSI, I thought about converting the single char to UNICODE, then convert again to what it should be in 7bit ASCII, i.e. è would become e. I read about wchar_t, and came up with this (just a short summary of the full code):

Code:

int func1()
{
int i;
while ((i = fgetc(stdin)) != EOF)
 {
  //some code....
  if (i > 127)
   convert(&i);
 }
 return 1;
}
int convert(char *c);
{
 wchar_t z = *c;
 if (z == 0x00E0) //'è' in UNICODE
  *c = 'e';
 else...//various checkings
 return 1
}

Then, just to be sure, I thought about printing z to check its value. I tried these:

Code:

printf(L"%lc", x);
printf(L"%ls", x);
printf("%lc", x);
printf("%ls", x);

and the above with wprintf(), but nothing displayed on screen.
Do you think there is something wrong with what I'm doing, both in written code or in thought process?

**prog-bman** · 03-22-2011

Shouldn't you be using the wide character functions to do your data input.

Look for wchar.h

**TheComedian** · 03-22-2011

tried to use wchar as data type, now it prints on screen but this doesn't work as I expected:

I thought by converting to UNICODE first, then I'd be able to determine what character I'm actually reading. Now it seems that this:

Code:

wchar_t a = fgetc(stdin);
wprintf(L"CHAR: %lc", a);

gives a different result from ASCII to ANSI, but even the program says the character is the same.
Example: I write è. Both ANSI and ASCII say that the character passed was è, but only the ASCII console understands that I'm writing 0x00E8, the ANSI one doesn't. Am I missing something here?

**~~CommonTater~~** · 03-22-2011

If you're trying to do this in console mode on windows you should know that a console window can be unicode (utf16le) or ansi (ascii + code page) *but not both*. You cannot simultaneously display ascii and unicode in the same console window... It won't do it.

Also you should know that Unicode is not convertable to ansi. When represented as multibyte characters some Unicode data points can expand to as many as 5 separate characters in utf8 which is totally undisplayable in ascii.

In all truth there is *no such thing* as Unicode to ansi conversion. You can represent ansi and ascii characters as unicode... but not the other way around.

I just finished a project in which I was required to convert Unicode text files in several different formats and languages to utf8 for internet access (via ftp) and it's anything but simple. Here's what it takes to simply open a unicode text file... converting it to utf8 takes about the same amount of code again. But it will never be a plain ascii file...

Code:

// open and translate file
BOOL M3ULaunch(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWChar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWChar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
     finally  
      { free(rf); }
    return 1; }

**bithub** · 03-22-2011

Just because you put the result into a wchar_t doesn't make it unicode. Unicode is just a character set, it can't even be represented into bytes without using an encoding. I think you're getting in a little over your head here. The fact that you're asking to convert non-ASCII characters to ASCII shows how much you need to learn. For instance, how would you convert the ﳍ character to ASCII?

Thread: ASCII checking and UNICODE conversion

Thread Tools

Search Thread

Display

ASCII checking and UNICODE conversion