Thread: Exception Handling to continue execution

  1. #1
    Registered User
    Join Date
    Feb 2009
    Posts
    19

    Exception Handling to continue execution

    Hi,

    I am trying to parse an XML file using Libxml2 parser. When I am parsing the file, I get the following exception:

    parser error : Input is not proper UTF-8, indicate encoding !

    Bytes: 0xC7 0x41 0x49 0x53 <Name>AMITIÉS FRAN╟AISES ANTWERPEN</Name>


    I would want to handle the exception such that the code handles this exception and procedes to the next node for processing without exiting the application.

    Is this possible in C programming?

  2. #2
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    In C sure, if you write your own parser.

    But you specified that you are using Libxml2. So you should look at the docs for that to figure out how to do that.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  3. #3
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    maybe it is easier just to fix the file - to make it proper xml before parsing. I'm not sure the libxml parser is supposed to partially parse broken xml
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  4. #4
    Registered User
    Join Date
    Feb 2009
    Posts
    19
    Any suggestions on how to fix the file ? I cannot make out what encoding is used in the XML file. Is there a way to convert the XML file to a standard encoding, say utf-8 without knowing what the actual encoded format is?

    This is imperative as I cannot ascertain the encoding from the XML source.

  5. #5
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    Quote Originally Posted by dunxton View Post
    Any suggestions on how to fix the file ? I cannot make out what encoding is used in the XML file. Is there a way to convert the XML file to a standard encoding, say utf-8 without knowing what the actual encoded format is?

    This is imperative as I cannot ascertain the encoding from the XML source.
    Have you tried to open file in the textpad and then Save AS... and chose the UTF-8 encoding?

    or Add at the begining of the xml-file string

    Code:
    <?xml version="1.0" encoding="UTF-8"?>
    with the correct version and encoding?
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  6. #6
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    That encoding looks like 0-255 ASCII. The vertical bar before the "AIS" is the bad character - it is not UTF-8. It appears to be extended ASCII, like from the old DOS days. See here: http://www.asciitable.com/
    0XC7 is decimal 199, and that matches the table at that link. (Scroll down to see the high codes).

    edit: Try this code page: 437. http://www.microsoft.com/globaldev/r...e/oem/437.mspx
    Last edited by Dino; 02-02-2009 at 07:28 AM.
    Mainframe assembler programmer by trade. C coder when I can.

  7. #7
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    dunxton, why are you having a hard time accepting the fact that it's INVALID XML? Since it's invalid, you can treat it however you want. The best you can do is ask the "3rd party" how they are incorrectly encoding the text so that you can make a best guess as to what the original text was supposed to be. Perhaps it really is just ISO-8859-1 and they are only using the encoding declaration - who knows.

    Or better yet, tell them to fix it!
    http://www.codeguru.com/forum/showthread.php?t=469816

    gg

  8. #8
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Here's an example of how easy this would be to correct with a Java program:
    http://bytes.com/groups/net-c/229845...p437-text-file
    Mainframe assembler programmer by trade. C coder when I can.

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Changing between code-pages and Unicode is one thing (which, yes, is easier in Java). But the only way to "correct" the XML is know how the author incorrectly encoded it. If dunxton (kmajay) is lucky, the author simply used a UTF8 encoding declaration, when they should of used ISO-8859-1.

    ISO-8859-1 code page in Windows (28591): http://www.microsoft.com/globaldev/r...iso/28591.mspx
    Latin Capital Letter C With Cedilla in UTF8 and other code pages: http://www.tachyonsoft.com/uc0000.htm#U00C7

    gg

  10. #10
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Why does the XML have to be corrected? Why wouldn't this work?
    Code:
    <?xml version="1.0" encoding="cp437"?>
    (I guess in answer to my own question "why might it not work", might be because the parser does not recognize that encoding.)
    Mainframe assembler programmer by trade. C coder when I can.

  11. #11
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Oooooooh, I get it now. That is supposed to be a C (of some sort). cp437 is not what you want, but probably what Codeplug suggested. So, change the header to read
    Code:
    <?xml version="1.0" encoding="ISO-8859-1"?>
    which is a pretty standard codpage, and the parser should have no issue with it.

    Without encoding="..." specified, libxml2 is probably defaulting to UTF-8.
    Mainframe assembler programmer by trade. C coder when I can.

  12. #12
    Registered User
    Join Date
    Feb 2009
    Posts
    19
    Thanks guys for the help.

    Codeplug, I very well acknowledge the fact that it is an invalid XML. But I am in no position to go back to the Vendor of the XML file for correction. So I was looking for alternative options. But then , as you said, there is probably no other option other than guess the encoding and try.

    But what I dont understand is:

    The input string is "AMITIÉS FRANÇAISES ANTWERPEN" . This string has two special characters: É and Ç and both are stored as 2 bytes in UTF-8 and as one byte in iso-8859-1. However, when I parse it, the error is happening only with Ç and not with É . If the parser fails to parse Ç as UTF-8 then it should fail to parse É as well. Why is this not happening?
    Last edited by dunxton; 02-03-2009 at 06:12 AM.

  13. #13
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Sounds like a question for the author of the libxml2 library.
    Mainframe assembler programmer by trade. C coder when I can.

  14. #14
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> But I am in no position to go back to the Vendor of the XML file for correction.
    What would stop you? Do they not exist anymore?

    >> no other option other than guess
    Yep. Or you could ask the Vender what they're doing. If that truly isn't an option then with a few more data points you could at least make an educated guess.

    >> when I parse it, the error is happening only with Ç and not with É
    Open up the file in a hex editor. What byte(s) do you see for the É character? Here are the possibilities: http://www.tachyonsoft.com/uc0000.htm#U00C9

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. signal handling and exception handling
    By lehe in forum C++ Programming
    Replies: 2
    Last Post: 06-15-2009, 10:01 PM
  2. Exception handling in a large project
    By EVOEx in forum C++ Programming
    Replies: 7
    Last Post: 01-25-2009, 07:33 AM
  3. exception handling
    By coletek in forum C++ Programming
    Replies: 2
    Last Post: 01-12-2009, 05:28 PM
  4. is such exception handling approach good?
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 12-27-2007, 08:54 AM
  5. Signal and exception handling
    By nts in forum C++ Programming
    Replies: 23
    Last Post: 11-15-2007, 02:36 PM