Massive Unicode Grumble....

**~~CommonTater~~** · 05-18-2011

Boy oh boy what I wouldn't like to do to the fool who invented 1,813 different ways of storing "plain text" files....

Frustrated to no end with BOMs and LE and BE and CR/LF and LF/CR... and on and on and on...

Really... what on earth were they thinking?

**Elysia** · 05-19-2011

Use a library. Let them eat the complexity. Abstractions are good.

**whiteflags** · 05-19-2011

Or invent a time machine. This could have been prevented by protecting the Tower of Babel.

**~~CommonTater~~** · 05-19-2011

Originally Posted by Elysia

Use a library. Let them eat the complexity. Abstractions are good.

I don't shy away from complexity.
Actually I prefer to (and did) write my own library for it... nearly 100 lines of code, just to open a .txt file... My Lord!

Code:

// convert mbyte to utf16le for parser
void CopyMByte(PBYTE Buf, DWORD Bytes)
  { PWCHAR ut = calloc(Bytes + 1,sizeof(WCHAR));     // unicode buffer
    try
      { if (MultiByteToWideChar(CP_UTF8,0,(PCHAR)Buf,Bytes,ut,Bytes * sizeof(WCHAR)) < 1) 
          Exception(0xE0640006);
        CopyWChar( ut ); }    
    finally
      { free (ut); } }



// convert UTF-16 byte order
void FlipEndian(PBYTE Buf, DWORD Bytes)
  { BYTE t; // temp for swaps
    for (INT i = 0; i < Bytes; i += 2)
      { t = Buf[i];
        Buf[i] = Buf[i + 1];
        Buf[i + 1] = t; } }



// open and translate file
BOOL FileLaunch(PWCHAR FileName)
  { PBYTE  rf;      // raw file data
    DWORD  br;      // bytes read
    // load the raw file
    { HANDLE pl;    // playlist file handle 
      DWORD  fs;    // file size
      // get path to file
      wcsncpy(FilePath,FileName,MAX_PATH);
      PathRemoveFileSpec(FilePath);
      wcscat(FilePath,L"\\");
      // open the file
      pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
      if (pl == INVALID_HANDLE_VALUE)
        Exception(GetLastError());
      fs = GetFileSize(pl,NULL);        
      rf = calloc(fs + 2, sizeof(BYTE));
      if (! ReadFile(pl, rf, fs, &br, NULL))
        Exception(GetLastError());
      CloseHandle(pl);  
      if (br != fs)
        Exception(0xE00640007); } 
    try                                   
     { DWORD bom = *(DWORD*)rf;
       if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
         Exception(0xE0640002);                         // utf32be bom  
       else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
         { FlipEndian(rf,br);
           CopyWChar((PWCHAR) rf + 1); }
       else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
         CopyWChar((PWCHAR) rf + 1);  
       else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
         CopyMByte(rf + 3, br - 3);
       else                                             // no known bom, probe the file
         { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
             CopyMByte(rf,br);                          // ansi / utf8 no bom
           else 
            { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
              if (!lf) 
                Exception(0xE0640003);
              if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                   (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                 Exception(0xE0640002);    
              if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                FlipEndian(rf,br);                      // utf16be no bom  
              CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
     finally  
      { free(rf); }
    return 1; }

So much for ... fp = fopen("myfile.txt","r");

**Yarin** · 05-19-2011

Originally Posted by CommonTater

I don't shy away from complexity.
Actually I prefer to (and did) write my own library for it... nearly 100 lines of code, just to open a .txt file... My Lord!

...

So much for ... fp = fopen("myfile.txt","r");

And you even write really compact code at that.
I'm curious though, unless this is professional or something, this seems like overkill for most projects. Not very may people use things other than ASCII or UTF-8 in plain text files. And even then, it what cases do you really need to be able to open UTF-32 BE as well as ASCII?

**~~CommonTater~~** · 05-19-2011

This entire headche started when I was asked to convert a large number of playlist files that all referred to the same FTP file structure, but were written on several different machines, using differing media players and text editors... When checking the files I discovered that only about half of them were simple ASCII... some were UTF8 (which was the target format)... but the rest were a dogs breakfast of UTF16BE and UTF16LE with both Windows (CR/LF) and Posix (LF) line endings... Just to make it worse, some had Byte Order Messages, most didn't. The site operator wanted to standardize on UTF8 because it's the "new" internet standard and because he figured it would save him a fair chunk of disk space. My job was first to batch convert what he already had then automate the conversion for new uploads. Some of these playlists have 10,000 files listed in them.

That's where that FileLaunch function came from...
It will open any text format ... ASCII, ANSI, UTF16LE/BOM, UTF16LE, UTF16BE/BOM, UTF16BE, UTF8/BOM or UTF8... it also handles line endings such as CR/LF, LF/CR, CR, LF, NULL...

But it did not pay my increased Excedrin bill.

How often do you encounter this? Well, if you are working with files that you did not create... it could happen at any time. In fact, it will happen more and more as time goes on and we move away from the "English is the language of computing" mindset. Everything is going to get more complex...
I recently had the joy (?) of watching one of my programs working with Cyrillic file names...

I really do wish they would settle down and say "From now on text is represented as 32bit values"... and get back to the "one standard" rule... This business of having to crack upwards of 20 possibilities in every program where I load a file is just ridiculous.

I'm just waiting for the day an instructor gives his class a text file to process in an exercise and saves the thing as UTF16BE/BOM for a course based on Windows.

**phantomotap** · 05-19-2011

O_o

If you add a tiny bit of "bug fix" code to this Windows API you'll have a cleaner, more robust checking mechanism.

[Edit]
Only offered because the code is already tied to Microsoft "Windows" API. If it wasn't for that I'd just suggest spending some more time with that code.
[/Edit]

Soma

IsTextUnicode Function (Windows)

**Elysia** · 05-19-2011

Your code is unreadable due to poor indentation (or should I say code style?).
And where does finally come from?
And finally, if you would just look around, you would find a good Unicode handling library. There's even one for C++ that makes opening files a cinch!
You don't need to reinvent the wheel every time.

**CornedBee** · 05-19-2011

Originally Posted by CommonTater

I really do wish they would settle down and say "From now on text is represented as 32bit values"... and get back to the "one standard" rule... This business of having to crack upwards of 20 possibilities in every program where I load a file is just ridiculous.

Well, you know how it works. When you try to replace N technologies with a new one, you end up with N+1 technologies.

**phantomotap** · 05-19-2011

Wouldn't help much anyway if you work on the glyph/shaping/layout side of things.

Soma

**~~CommonTater~~** · 05-19-2011

Originally Posted by Elysia

Your code is unreadable due to poor indentation (or should I say code style?).
And where does finally come from?
And finally, if you would just look around, you would find a good Unicode handling library. There's even one for C++ that makes opening files a cinch!
You don't need to reinvent the wheel every time.

We've had this discussion before... you con't like the way I set up my code... I do. (Get used to it.)
try and finally are implemented by another library I wrote which does SEH on Pelles C. (Open Source on PellesC forums)

Why would I look for another library? I already wrote one... It's all nicely tucked away as a .lib file I can use any time I need it.

Moreover, it's far more successful than the IsTextUnicode() call in WinApi... theirs got about half the files I was working with wrong... mine hasn't missed yet and it goes the extra bit of actually translating the text to wchar_t (UTF16LE) for me.

I dunno, Elysia, you seem bent upon using 3rd party libraries for everything. I've never used one yet (except to mess around with). I actually enjoy the challenges of writing my own .lib files...

**~~CommonTater~~** · 05-19-2011

Originally Posted by CornedBee

Well, you know how it works. When you try to replace N technologies with a new one, you end up with N+1 technologies.

Ain't that the truth.

**Yarin** · 05-19-2011

Originally Posted by CommonTater

... Elysia, you seem bent upon using 3rd party libraries for everything. I've never used one yet (except to mess around with). ...

O.o I find that hard to believe

**~~CommonTater~~** · 05-19-2011

Originally Posted by Yarin

O.o I find that hard to believe

LOL ... and how would you have me convince you?

Really... I use Pelles C and Windows API... everything else comes directly from me.

**whiteflags** · 05-20-2011

But Windows API is a third party library: The first party includes your stuff, the second party would be ISO C stuff, and the third party is Windows stuff. "Never" makes the statement false more often than not, just like you learned in school.

Thread: Massive Unicode Grumble....

Thread Tools

Search Thread

Display

Massive Unicode Grumble....

Similar Threads

Unicode vurses Non Unicode client server application with winsock2 query?

Unicode Help

GUI, Unicode and +

Should I go to unicode?

non-unicode app reading unicode texts