xml file parsing in C

This is a discussion on xml file parsing in C within the Tech Board forums, part of the Community Boards category; The tag do not have attributes, there is no CDATA so it's all good. The idea is good but the ...

  1. #16
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    The tag do not have attributes, there is no CDATA so it's all good.
    The idea is good but the following code does not work with Ksh.

    Code:
    <bla> body </bla>
    also we can have an empty tag, like this :
    Code:
     </bla>
    How do you loop in a XML file
    and store the opening tag into $1 and the body of the tag inside $2 for treatment ?

    Code:
     if($input_xml_sample =~ /\<([A-Za-z0-9_\-]{1,40})\>(.*?)\<\/\1\>/s ) {
      # \1 refers to contents matched within the 1st () group
      $tag_name = $1;                                                   #get contents from first () group
      $tag_contents = $2;                                              # get contents from 2nd () group
      $tag_contents =~ tr/\<\>/ /;                                    # change < and > to spaces
    }
    thanks for helping

  2. #17
    CSharpener vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,484
    empty tag should be like this
    Code:
    <bla/>
    The first 90% of a project takes 90% of the time,
    the last 10% takes the other 90% of the time.

  3. #18
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    I don't think you can do this with ksh (becuase I don't think it allows you to refer to what previously matched in the same pattern, i.e. \1). But since you were just going calling sed all the time before, you could just call an awk or perl script that will do everything in one go. I don't know much about awk, but I pieced together a perl script from the examples I posted earlier. A lot of unix installations come with perl. Calling it with ./my_script.pl should be enough.
    Anyways, you didn't cofirm that you only have 1 tag pair / line, but I assumed this is so based on what you said earlier. If you use it, make sure to test it on your files. The script will harshly replace extra <s and >s with spaces or altogether if they are the last or first thing in the xml_contents_string.

    This script won't normally print to STDOUT, but straight to file... change if you need. Will
    warn/die if something is completly wrong (i.e. a line where nothing matched), but not if some data is discarded bc there is more than 1 tag pair perl line.
    Code:
    #! /usr/bin/perl
    # or try /usr/local/bin/perl if above doesn't work
    use warnings;
    use strict;
    open(INPUT_FILE, '<my_xml_file') or die "Can't open input file: $!";
    open(OUTPUT_FILE, '>my_xml_output_file') or die "Can't open output file: $!";
      #the above will truncate output file, if it exists
    my ($input_xml_sample, $tag_name, $tag_contents);
    my $line_num = 1;
    
    while($input_xml_sample = <INPUT_FILE>) {
      # reading one line at a time. This assumes strictly one tag pair per line and no more!
      # Anything not part of the tag or the contents will be discared
      # Example: "  <some_tag:abc>  Some random ><useless> text &gt; all   </some_tag:abc> "
      # Bad:     " <just_an_open_tag> on this line    "
      # Bad:     " some text <just_a:closer />   " (Only tag name will make it to file)
    
      if($input_xml_sample =~ /\<([A-Za-z0-9_\-\:]{1,40})\>(.*?)\<\/\1\>/s ) {
        # \1 refers to contents matched within the 1st () group
        # does not handle tag atrributes. If Tages contain anyhting other than
        # A-Za-z0-9_-: or is longer than 40 than edit above group
        $tag_name     = $1;                              #get contents from first () group
        $tag_contents = $2;                              # get contents from 2nd () group
        $tag_contents =~ tr/\<\>/ /;                     # change < and > to spaces
        $tag_contents =~ s/\&lt\;/ /g;                   # get rid of &lt;
        $tag_contents =~ s/\&gt\;/ /g;
        $tag_contents =~ s/\&\#[0-9]{1,5}\;/ /g;         # matches decimal specials, i.e. Dž Change as needed!!
        $tag_contents =~ s/\&\#x[0-9a-fA-F]{1,5}\;/ /g;  #matches hexaecimal specials, i.e. &#xA4;
        $tag_contents =~ s/^ +//;                        # get rid of leading spaces
        $tag_contents =~ s/ +$//;                        # trainling spaces
        print OUTPUT_FILE "<$tag_name>$tag_contents</$tag_name>\n";
      }
      elsif($input_xml_sample  =~ /\<([A-Za-z0-9_\-\:]{1,40}) {0,10}\/\>/ ) {
        # empty tag, i.e. "<tag />"
        print OUTPUT_FILE "<$1 />\n";
      }
      else {
        if($input_xml_sample ne "\n") {
          warn("Couldn't use line $line_num at all!");
        }
      }
      $line_num++;
    }
    close(INPUT_FILE);
    close(OUTPUT_FILE);

  4. #19
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    > since there is only one tag / line
    See, more vital info

    How about this
    Code:
    #!/usr/bin/perl -w
    use strict;
    
    while ( <> ) {
      chomp;
      if ( /<(\w+)>(.*?)<\/\1>/ ) {
        my ($tag,$content) = ($1,$2);
        $content =~ s/[<>]//g;
        print "<$tag>$content</$tag>\n";
      } else {
        print "$_\n";
      }
    }
    
    $ example, input in red, output black
    $ ./foo.pl
    <foo> blabla < bla </foo>
    <foo> blabla  bla </foo>
    <foo>> blablabla </foo>
    <foo> blablabla </foo>
    <foo/>
    <foo/>
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  5. #20
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    coder8137 :
    thanks very very much ! I am using your solution and learning perl in the same occasion !
    That is great, it works.
    I modified it so that it allows tags with dots in the names like
    Code:
     <aa.bb>
    and opening tags because there is only one pair of tag per line but you can also find lines
    with opening tags only
    Code:
     <aa>
    and of course closing tags
    Code:
     </aa>
    Now, how do you pass a file as argument instead of having it hard-coded inside the file ?
    Also, I want to return error codes in case one step failed, how do i do that ?

    Here is a sample of XML file to deal with :
    Code:
    <?xml version="1.0" encoding="ISO-8859-1"?>                                                         
    <!--Date traitement cyclades-->                                                                     
    <!--05/12/2006-->                                                                                   
    <lot>                                                                                               
    <facture>                                                                                           
    <Bloc.pointeur.ligne>                                                                               
    <Nombre.exemplaire.facture>01</Nombre.exemplaire.facture>                                           
    <Numero.client.fpl>117225</Numero.client.fpl>                                                       
    <Rang.client.fpl>02</Rang.client.fpl>                                                               
    <Numero.periode.fpl>0602</Numero.periode.fpl>                                                       
    <Numero.sequence.fpl>01</Numero.sequence.fpl>                                                       
    <Code.recto.verso>1</Code.recto.verso>                                                              
    <Code.MOF>2</Code.MOF>                                                                              
    <Code.MSE>0</Code.MSE>                                                                              
    <Numero.secteur>056                                     </Numero.secteur>                           
    <Code.CED>E</Code.CED>                                                                              
    <Critere.tri.secondaire>056                 </Critere.tri.secondaire>                               
    <Critere.tri.secondaire>11042               </Critere.tri.secondaire>                               
    <Critere.tri.secondaire>1                   </Critere.tri.secondaire>                               
    <Critere.tri.secondaire>0                   </Critere.tri.secondaire>                               
    <Critere.tri.secondaire>117225              </Critere.tri.secondaire>                               
    <Numero.logo.fpl>110</Numero.logo.fpl>                                                              
    <Structure.cle>3</Structure.cle>                                                                    
    <Cle.banalisee>110420204201</Cle.banalisee>                                                         
    <Adresse.1er.enreg>00000001</Adresse.1er.enreg>                                                     
    <Adresse.dernier.enreg>00000001</Adresse.dernier.enreg>                                             
    <Code.exploitation.fpl>98</Code.exploitation.fpl>                                                   
    <Date.exigibilite>17012007</Date.exigibilite>                                                       
    <Code.tache>1</Code.tache>                                                                          
    <Code.anomalie.F18>N</Code.anomalie.F18>                                                            
    <Modele.impression>4</Modele.impression>                                                            
    </Bloc.pointeur.ligne>                                                                              
    <Bloc.entete.cachet.logo.siret>                                                                     
    <Desig.agence>MAIRIE DE:                              </Desig.agence>                               
    <Desig.agence>BLOMAC                                  </Desig.agence>                               
    <Adresse.agence>6 RUERIE                      </Adresse.agence>                           
    <Adresse.agence>117MAC                          </Adresse.agence>                           
    <Tel.fax>                                                                                           
    <Libelle.tel.fax>   FAX 04.8.32  </Libelle.tel.fax>                                          
    <Telephone.fax>TEL5  </Telephone.fax>                                                 
    </Tel.fax>
    thanks a lot, this is a great active and helpful community !

  6. #21
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    How about this
    Nice; alot easier to integrate with. Mine was more of a stand alone script. Just \w doesn't match the magic colon, which I left out first time around too, but it is quite common in real XML, as Rashakil Fol pointed out, hehe.

  7. #22
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    Now, how do you pass a file as argument instead of having it hard-coded inside the file ?
    If you use Salem's script, pass the data to STDIN. If you use mine, just change the open line to read:
    Code:
    open(INPUT_FILE, "<$ARGV[0]") or die "Can't open input file: $!";
    And then call it with:
    Code:
    ./my_script.pl some_file_name
    CAUTION: The argument is not checked before opening the file. Therefore, this code is not designed for random input. In other words, it's vulnerable to abuse. You could use a regular expression or 2 to validate the input, if this bothers you.

    You can get error / return values by either capturing STDOUT or STDERR and checking it. Are you going to call from C?

  8. #23
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    I am not going to call it in C, but it will be part of a chain of batches.
    That is why I need to dal with return error codes.
    How do you do that in Perl ?

    Also, one requirement is to delete all extra spaces at the end of the lines, but the spaces between the tags should remain of course.
    I thought that if I would just keep this line it would work but it does not, it also removes the space between tags :
    Code:
         $tag_contents =~ s/ +$//;                        # trainling spaces
    I removed the line
    Code:
         $tag_contents =~ s/^ +//;                        # get rid of leading spaces

    In other words :
    Code:
     <bla> Once upon a time </bla>
    should remain as is and not be modified to
    Code:
     <bla>Onceuponatime</bla>
    I just want to get rid of the extra spaces after the closing tag, on each line.

    Thanks !

  9. #24
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    ok, i got it, i removed the wrong line. hehe

    Please help me regarding the error codes.
    Thanks

  10. #25
    Registered User
    Join Date
    Nov 2006
    Posts
    65
    For an easier way to do error handling, you can use the exit() function in perl. exit(0) for success, exit(1) for failure... like C. (You'd have to replace warn()) I don't know much about ksh, but generally echo $? or similar, should return the exit value of last command on a shell.

  11. #26
    Registered User
    Join Date
    Apr 2006
    Posts
    58

    re: xml file parsing in C

    You may also want to check out the Expat XML Parser (written in C). It may give you some ideas or you may be able to use some of the code.

    http://expat.sourceforge.net/

    Sam

  12. #27
    Registered User
    Join Date
    Apr 2003
    Posts
    21
    I got more precisions regarding the special characters :
    Code:
     
    <Cle.controle.TOP.2>b&#229;</Cle.controle.TOP.2>                                                         
    <Numero.organisme>&#246;3M</Numero.organisme>                                                            
    
    <Cle.controle.TOP.2>b&#230;</Cle.controle.TOP.2>                                                         
    <Numero.organisme>&#164;0O</Numero.organisme>
    I think there are basically 4 special characters :
    Code:
     &#229;  &#246; &#230; &#164;
    Which characters are these ? Unicode ? Ascii ?
    I want to suppress them (replace by a space).
    Is there a general way to deal with these special characters instead of listing them all in my code ?

    Code:
    $tag_contents =~ s/\&#246;\&#229;\&#230;\&#164;\/ /g;
    I have not tested this line yet by the way.

    Thanks in advance for your help

  13. #28
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    Perhaps your special characters would be better dealt with by
    Code:
    <?xml version="1.0" encoding="UTF-8"?>
    Though that depends on what is next in the set of things to do, and can it really cope with extended characters of some sort.

    > $tag_contents =~ s/\&#246;\&#229;\&#230;\&#164;\/ /g;
    A character range would catch more, say
    Code:
    $tag_contents =~ s/[\x80-\xff]/ /g;
    and is a lot less reliant on the interpretation of 'funny' characters.

    You should be able to run my script with
    ./foo.pl input.xml > output.xml
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  14. #29
    Registered User
    Join Date
    Oct 2004
    Posts
    151
    System: Debian Sid and FreeBSD 7.0. Both with GCC 4.3.

    Useful resources:
    comp.lang.c FAQ | C++ FQA Lite

  15. #30
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    The problem with the existing tools is that they're good at reading valid XML files.
    They're not so good at fixing XML files which are broken.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

Page 2 of 3 FirstFirst 123 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Data Structure Eror
    By prominababy in forum C Programming
    Replies: 3
    Last Post: 01-06-2009, 08:35 AM
  2. Inventory records
    By jsbeckton in forum C Programming
    Replies: 23
    Last Post: 06-28-2007, 04:14 AM
  3. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  4. System
    By drdroid in forum C++ Programming
    Replies: 3
    Last Post: 06-28-2002, 10:12 PM
  5. Need a suggestion on a school project..
    By Screwz Luse in forum C Programming
    Replies: 5
    Last Post: 11-27-2001, 01:58 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21