Thread: Massive Unicode Grumble....

  1. #1
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547

    Massive Unicode Grumble....

    Boy oh boy what I wouldn't like to do to the fool who invented 1,813 different ways of storing "plain text" files....

    Frustrated to no end with BOMs and LE and BE and CR/LF and LF/CR... and on and on and on...

    Really... what on earth were they thinking?

  2. #2
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Use a library. Let them eat the complexity. Abstractions are good.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    Or invent a time machine. This could have been prevented by protecting the Tower of Babel.

  4. #4
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Elysia View Post
    Use a library. Let them eat the complexity. Abstractions are good.
    I don't shy away from complexity.
    Actually I prefer to (and did) write my own library for it... nearly 100 lines of code, just to open a .txt file... My Lord!

    Code:
    // convert mbyte to utf16le for parser
    void CopyMByte(PBYTE Buf, DWORD Bytes)
      { PWCHAR ut = calloc(Bytes + 1,sizeof(WCHAR));     // unicode buffer
        try
          { if (MultiByteToWideChar(CP_UTF8,0,(PCHAR)Buf,Bytes,ut,Bytes * sizeof(WCHAR)) < 1) 
              Exception(0xE0640006);
            CopyWChar( ut ); }    
        finally
          { free (ut); } }
    
    
    
    // convert UTF-16 byte order
    void FlipEndian(PBYTE Buf, DWORD Bytes)
      { BYTE t; // temp for swaps
        for (INT i = 0; i < Bytes; i += 2)
          { t = Buf[i];
            Buf[i] = Buf[i + 1];
            Buf[i + 1] = t; } }
    
    
    
    // open and translate file
    BOOL FileLaunch(PWCHAR FileName)
      { PBYTE  rf;      // raw file data
        DWORD  br;      // bytes read
        // load the raw file
        { HANDLE pl;    // playlist file handle 
          DWORD  fs;    // file size
          // get path to file
          wcsncpy(FilePath,FileName,MAX_PATH);
          PathRemoveFileSpec(FilePath);
          wcscat(FilePath,L"\\");
          // open the file
          pl = CreateFile(FileName,GENERIC_READ,0,NULL,OPEN_EXISTING,FILE_ATTRIBUTE_NORMAL,NULL);
          if (pl == INVALID_HANDLE_VALUE)
            Exception(GetLastError());
          fs = GetFileSize(pl,NULL);        
          rf = calloc(fs + 2, sizeof(BYTE));
          if (! ReadFile(pl, rf, fs, &br, NULL))
            Exception(GetLastError());
          CloseHandle(pl);  
          if (br != fs)
            Exception(0xE00640007); } 
        try                                   
         { DWORD bom = *(DWORD*)rf;
           if ((bom == 0x0000FEFF) || (bom == 0xFFFE0000))  // utf32le bom  
             Exception(0xE0640002);                         // utf32be bom  
           else if ((bom & 0xFFFF) == 0xFFFE)               // utf16be bom
             { FlipEndian(rf,br);
               CopyWChar((PWCHAR) rf + 1); }
           else if ((bom & 0xFFFF) == 0xFEFF)               // utf16le bom
             CopyWChar((PWCHAR) rf + 1);  
           else if ((bom & 0xFFFFFF) == 0xBFBBEF)           // utf8 bom
             CopyMByte(rf + 3, br - 3);
           else                                             // no known bom, probe the file
             { if (! memchr(rf, 0x00, br))                  // 8 bit text has no nulls
                 CopyMByte(rf,br);                          // ansi / utf8 no bom
               else 
                { PBYTE lf = memchr(rf,0x0A,br);            // lf is always present as 1 byte.
                  if (!lf) 
                    Exception(0xE0640003);
                  if ((!(*(DWORD*)(lf - 3) & 0x00FFFFFF)) ||    //utf32be no bom
                       (!(*(DWORD*)lf & 0xFFFFFF00)))           //utf32le no bom
                     Exception(0xE0640002);    
                  if ((lf - rf) & 1)                        // big endian? (lf at odd offset)
                    FlipEndian(rf,br);                      // utf16be no bom  
                  CopyWChar((PWCHAR) rf);  } } }            // utf16le no bom
         finally  
          { free(rf); }
        return 1; }

    So much for ... fp = fopen("myfile.txt","r");
    Last edited by CommonTater; 05-19-2011 at 09:48 AM.

  5. #5
    Unregistered User Yarin's Avatar
    Join Date
    Jul 2007
    Posts
    2,158
    Quote Originally Posted by CommonTater View Post
    I don't shy away from complexity.
    Actually I prefer to (and did) write my own library for it... nearly 100 lines of code, just to open a .txt file... My Lord!

    ...

    So much for ... fp = fopen("myfile.txt","r");
    And you even write really compact code at that.
    I'm curious though, unless this is professional or something, this seems like overkill for most projects. Not very may people use things other than ASCII or UTF-8 in plain text files. And even then, it what cases do you really need to be able to open UTF-32 BE as well as ASCII?

  6. #6
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    This entire headche started when I was asked to convert a large number of playlist files that all referred to the same FTP file structure, but were written on several different machines, using differing media players and text editors... When checking the files I discovered that only about half of them were simple ASCII... some were UTF8 (which was the target format)... but the rest were a dogs breakfast of UTF16BE and UTF16LE with both Windows (CR/LF) and Posix (LF) line endings... Just to make it worse, some had Byte Order Messages, most didn't. The site operator wanted to standardize on UTF8 because it's the "new" internet standard and because he figured it would save him a fair chunk of disk space. My job was first to batch convert what he already had then automate the conversion for new uploads. Some of these playlists have 10,000 files listed in them.

    That's where that FileLaunch function came from...
    It will open any text format ... ASCII, ANSI, UTF16LE/BOM, UTF16LE, UTF16BE/BOM, UTF16BE, UTF8/BOM or UTF8... it also handles line endings such as CR/LF, LF/CR, CR, LF, NULL...

    But it did not pay my increased Excedrin bill.

    How often do you encounter this? Well, if you are working with files that you did not create... it could happen at any time. In fact, it will happen more and more as time goes on and we move away from the "English is the language of computing" mindset. Everything is going to get more complex...
    I recently had the joy (?) of watching one of my programs working with Cyrillic file names...

    I really do wish they would settle down and say "From now on text is represented as 32bit values"... and get back to the "one standard" rule... This business of having to crack upwards of 20 possibilities in every program where I load a file is just ridiculous.

    I'm just waiting for the day an instructor gives his class a text file to process in an exercise and saves the thing as UTF16BE/BOM for a course based on Windows.
    Last edited by CommonTater; 05-19-2011 at 12:36 PM.

  7. #7
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    If you add a tiny bit of "bug fix" code to this Windows API you'll have a cleaner, more robust checking mechanism.

    [Edit]
    Only offered because the code is already tied to Microsoft "Windows" API. If it wasn't for that I'd just suggest spending some more time with that code.
    [/Edit]

    Soma

    IsTextUnicode Function (Windows)
    Last edited by phantomotap; 05-19-2011 at 02:28 PM.

  8. #8
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Your code is unreadable due to poor indentation (or should I say code style?).
    And where does finally come from?
    And finally, if you would just look around, you would find a good Unicode handling library. There's even one for C++ that makes opening files a cinch!
    You don't need to reinvent the wheel every time.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by CommonTater View Post
    I really do wish they would settle down and say "From now on text is represented as 32bit values"... and get back to the "one standard" rule... This business of having to crack upwards of 20 possibilities in every program where I load a file is just ridiculous.
    Well, you know how it works. When you try to replace N technologies with a new one, you end up with N+1 technologies.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Wouldn't help much anyway if you work on the glyph/shaping/layout side of things.

    Soma

  11. #11
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Elysia View Post
    Your code is unreadable due to poor indentation (or should I say code style?).
    And where does finally come from?
    And finally, if you would just look around, you would find a good Unicode handling library. There's even one for C++ that makes opening files a cinch!
    You don't need to reinvent the wheel every time.
    We've had this discussion before... you con't like the way I set up my code... I do. (Get used to it.)
    try and finally are implemented by another library I wrote which does SEH on Pelles C. (Open Source on PellesC forums)

    Why would I look for another library? I already wrote one... It's all nicely tucked away as a .lib file I can use any time I need it.

    Moreover, it's far more successful than the IsTextUnicode() call in WinApi... theirs got about half the files I was working with wrong... mine hasn't missed yet and it goes the extra bit of actually translating the text to wchar_t (UTF16LE) for me.

    I dunno, Elysia, you seem bent upon using 3rd party libraries for everything. I've never used one yet (except to mess around with). I actually enjoy the challenges of writing my own .lib files...
    Last edited by CommonTater; 05-19-2011 at 05:44 PM.

  12. #12
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by CornedBee View Post
    Well, you know how it works. When you try to replace N technologies with a new one, you end up with N+1 technologies.
    Ain't that the truth.

  13. #13
    Unregistered User Yarin's Avatar
    Join Date
    Jul 2007
    Posts
    2,158
    Quote Originally Posted by CommonTater View Post
    ... Elysia, you seem bent upon using 3rd party libraries for everything. I've never used one yet (except to mess around with). ...
    O.o I find that hard to believe

  14. #14
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Yarin View Post
    O.o I find that hard to believe
    LOL ... and how would you have me convince you?

    Really... I use Pelles C and Windows API... everything else comes directly from me.

  15. #15
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    But Windows API is a third party library: The first party includes your stuff, the second party would be ISO C stuff, and the third party is Windows stuff. "Never" makes the statement false more often than not, just like you learned in school.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Unicode vurses Non Unicode client server application with winsock2 query?
    By dp_76 in forum Networking/Device Communication
    Replies: 0
    Last Post: 05-16-2005, 07:26 AM
  2. Unicode Help
    By Coder87C in forum C++ Programming
    Replies: 3
    Last Post: 04-06-2005, 11:14 AM
  3. GUI, Unicode and +
    By _Martin_ in forum C++ Programming
    Replies: 2
    Last Post: 04-14-2004, 03:27 PM
  4. Should I go to unicode?
    By nickname_changed in forum C++ Programming
    Replies: 10
    Last Post: 10-13-2003, 11:37 AM
  5. non-unicode app reading unicode texts
    By marsface in forum Windows Programming
    Replies: 5
    Last Post: 06-26-2003, 01:55 PM