Thread: xml file parsing in C

  1. #1
    Registered User
    Join Date
    Apr 2003
    Posts
    21

    xml file parsing in C

    hi,
    is it possible to parse an XML file in C so that i can fulfill these requirements :
    1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
    Example 1:
    Code:
     <foo> blabla < bla </foo>
    becomes
    Code:
     <foo> blabla bla </foo>
    Example 2:
    Code:
     <foo>> blablabla </foo>
    becomes
    Code:
     <foo> blablabla </foo>
    2) Remove all extra spaces at the end of every line of the XML file
    3) Replace all special characters ( Unicode or Hexadecimal characters) by a space


    I mean the XML file is not well formed if there are "<" and ">" signs a little bit everywhere,
    it is not a valid file in that case, so i do not think the use of a parser would be appropriate in that case. (How would the parser react when it encounters a < that does not correspond to the beginning of a tag ???)

    Do you have an idea on how i can write a program to deal with these requirements ?
    Technical environment is : Unix, KSH, and C (gcc)

    I am thinking of using the "sed" command instead, i can get rid of the extra spaces and replace the special characters but i still do not know how to deal with the extra ">" and "<" signs.

    Thanks for your help.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Do you have any more bad examples?

    For example,
    if > is preceded by anything other than a letter, delete it
    if < is followed by anything other than a letter, delete it.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    thanks Salem for replying.
    Unfortunately there are many more bad examples, so your solution would not work.

    Does anyone else have an idea ?

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Well it's hard to come up with ideas if you don't post more examples to work with.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    aoeuhtns
    Join Date
    Jul 2005
    Posts
    581
    The first rule of parsing XML is, don't parse it yourself. Find an XML parsing library.

    It is impossible for an XML document to contain < in the text, if it is well-formed, so your program should just throw an error on these. What if you find &lt; and &gt; in the text? Do you want to replace these with spaces?
    There are 10 types of people in this world, those who cringed when reading the beginning of this sentence and those who salivated to how superior they are for understanding something as simple as binary.

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    I agree.

    If it's that broken, perhaps you should be looking at what broke the file so horribly in the first place rather than trying to fix it here.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    Well, this is what I have done so far : I have written a ksh file that calls
    the command SED 3 times, in fact for each requirement as specified above:

    2) sed -e 's/ *$//' old_file > file_with_no_space
    3) sed -e 's/$E5/ /' -e 's/$F6/ /' (etc ...) file_with_no_space file_with_no_special_characters

    1) This is where I am still looking for a way to do it. How can I replace only the extras < and > that I find
    on every line ?

    I thought about an algorithm, since there is only one tag / line, it means you can only have 2 times
    the character '<' and the character '>'.
    If I find more than 2 times the character '<' that means i have to remove the ones that are not in the first position
    or the last position.
    Same thing with the '>' character.

    how would you do that in C or with the 'sed' command ? Or eventually in conjunction with the 'awk' and 'grep' command ?
    Also, how fast is it, knowing the file size varies from 5 mn to 300 mn ?

    thanks for your help

  8. #8
    {Jaxom,Imriel,Liam}'s Dad Kennedy's Avatar
    Join Date
    Aug 2006
    Location
    Alabama
    Posts
    1,065
    Quote Originally Posted by lonbgeach
    Also, how fast is it, knowing the file size varies from 5 mn to 300 mn ?
    Okay, I'm showing my ignorance here, but how much is a "mn"? I've never heard of this before. B, KB, MB, GB, TB, DB but mn????

  9. #9
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    Quote Originally Posted by Kennedy
    Okay, I'm showing my ignorance here, but how much is a "mn"? I've never heard of this before. B, KB, MB, GB, TB, DB but mn????
    oops, I meant MB of course

  10. #10
    Registered User Tonto's Avatar
    Join Date
    Jun 2005
    Location
    New York
    Posts
    1,465
    No, you meant mega nuggets.

  11. #11
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    Can you use a language which has native support for regular expressions, rather than calling sed all the time? Anyways, here're some regular expressions that may help (written in perl, but I think the RegEx syntax is relatively standard accross languages):
    Code:
    if($input_xml_sample =~ /\<([A-Za-z0-9_\-]{1,40})\>(.*?)\<\/\1\>/s ) {
      # \1 refers to contents matched within the 1st () group
      $tag_name = $1;                                                   #get contents from first () group
      $tag_contents = $2;                                              # get contents from 2nd () group
      $tag_contents =~ tr/\<\>/ /;                                    # change < and > to spaces
      $tag_contents =~ s/\&lt\;/ /g;                                 # get rid of &lt;      Repeat for &gt;
      $tag_contents =~ s/\&\#[0-9]{1,5}\;/ /g;               # matches decimal specials, i.e. &#453
      $tag_contents =~ s/\&\#x[0-9a-fA-F]{1,5}\;/ /g;  #matches hexaecimal specials, i.e. &#xA4;
      $tag_contents =~ s/^ +//;                                      # get rid of leading spaces
      $tag_contents =~ s/ +$//;                                      # trainling spaces
    }
    These regex are just example and assume that at least the tags are in the right format though. I.e. tag opening and closing name must match and only contain alphanumerics with no spaces.
    Last edited by coder8137; 12-12-2006 at 09:26 PM.

  12. #12
    aoeuhtns
    Join Date
    Jul 2005
    Posts
    581
    That, of course, is broken. You're expecting tags not to have any attributes? What about CDATA sections? Tag names must be less than 41 characters long? No colons, unicode characters, in tag names?
    There are 10 types of people in this world, those who cringed when reading the beginning of this sentence and those who salivated to how superior they are for understanding something as simple as binary.

  13. #13
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    coder8137 : thanks for the effort !
    I will try to use that code.
    Do you have any suggestion on how to use it to check every line ? I mean, I need to put that code inside a loop.
    Thanks for your help much appreciated.

  14. #14
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    That, of course, is broken. You're expecting tags not to have any attributes? What about CDATA sections? Tag names must be less than 41 characters long? No colons, unicode characters, in tag names?
    It was a regular expression example based on the data in the top post, not a full XML praser. I expected whoever wants to use it to apply common sense and adjust it to whatever their needs may be. For example, if you want attributes, an example could be...
    Code:
    $input_xml_sample =~ /\<([A-Za-z0-9_\:\-]{1,40})[^\>]{0,120}\>(.*?)\<\/\1\>/s
    Now the tag name only matches the 1st word (+ colons). Presuambly, there will then be a space and then the arguments. If you need colons.... add it to the list. If you don't like 40... pick a higher number. The reason I put in limits is because he stated that his file is not proper XML to begin with. He mentioned speed also; if you omit the limits now, and just use *s and +s, your regex is a lot worse for bad XML.

    Regarding unicode. I don't see any unicode in the example at the top, but he mentioned speed. No point to introduce unicode when not needed, as this will slow things down dramatically. [A-Za-z0-9] will match 62 characters. The unicode equivalent will probably match 1,000 times more characters.

    Lastly, regarding CDATA. Parsing an XML file is more than 1 RegEx. If you want to use unicode, the 1st step is probably to get all the character encodings right in whatever programming language you chose. You would then probably be served best by removing all comments from the XML. Then you might remove all CDATA. Then you might do whatever else you have to do. And then you can parse it.

  15. #15
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    Do you have any suggestion on how to use it to check every line ? I mean, I need to put that code inside a loop.
    Depends on your file. You said that there is only one tag per line, right? So, if you use perl, you can simply read in one line at a time and use the regex. Add an else clause to the if statement, to work out when it doesn't match. Read the above regarding Comments and CDATA, which need to be cleared 1st. If you want to use C and sed, you can again read one line, but getting $1 and $2 from sed will probably be harder. To help you any more, I'd need to know what language you will use and/or a file sample.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Data Structure Eror
    By prominababy in forum C Programming
    Replies: 3
    Last Post: 01-06-2009, 09:35 AM
  2. Inventory records
    By jsbeckton in forum C Programming
    Replies: 23
    Last Post: 06-28-2007, 04:14 AM
  3. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  4. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM
  5. Need a suggestion on a school project..
    By Screwz Luse in forum C Programming
    Replies: 5
    Last Post: 11-27-2001, 02:58 AM