xml file parsing in C

**lonbgeach** · 12-12-2006

hi,
is it possible to parse an XML file in C so that i can fulfill these requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:

Code:

 <foo> blabla < bla </foo>

becomes

Code:

 <foo> blabla bla </foo>

Example 2:

Code:

 <foo>> blablabla </foo>

becomes

Code:

 <foo> blablabla </foo>

2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a space

I mean the XML file is not well formed if there are "<" and ">" signs a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser would be appropriate in that case. (How would the parser react when it encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra spaces and replace the special characters but i still do not know how to deal with the extra ">" and "<" signs.

Thanks for your help.

**Salem** · 12-12-2006

Do you have any more bad examples?

For example,
if > is preceded by anything other than a letter, delete it
if < is followed by anything other than a letter, delete it.

**lonbgeach** · 12-12-2006

thanks Salem for replying.
Unfortunately there are many more bad examples, so your solution would not work.

Does anyone else have an idea ?

**Salem** · 12-12-2006

Well it's hard to come up with ideas if you don't post more examples to work with.

**Rashakil Fol** · 12-12-2006

The first rule of parsing XML is, don't parse it yourself. Find an XML parsing library.

It is impossible for an XML document to contain < in the text, if it is well-formed, so your program should just throw an error on these. What if you find < and > in the text? Do you want to replace these with spaces?

**Salem** · 12-12-2006

I agree.

If it's that broken, perhaps you should be looking at what broke the file so horribly in the first place rather than trying to fix it here.

**lonbgeach** · 12-12-2006

Well, this is what I have done so far : I have written a ksh file that calls
the command SED 3 times, in fact for each requirement as specified above:

2) sed -e 's/ *$//' old_file > file_with_no_space
3) sed -e 's/$E5/ /' -e 's/$F6/ /' (etc ...) file_with_no_space file_with_no_special_characters

1) This is where I am still looking for a way to do it. How can I replace only the extras < and > that I find
on every line ?

I thought about an algorithm, since there is only one tag / line, it means you can only have 2 times
the character '<' and the character '>'.
If I find more than 2 times the character '<' that means i have to remove the ones that are not in the first position
or the last position.
Same thing with the '>' character.

how would you do that in C or with the 'sed' command ? Or eventually in conjunction with the 'awk' and 'grep' command ?
Also, how fast is it, knowing the file size varies from 5 mn to 300 mn ?

thanks for your help

**Kennedy** · 12-12-2006

Originally Posted by lonbgeach

Also, how fast is it, knowing the file size varies from 5 mn to 300 mn ?

Okay, I'm showing my ignorance here, but how much is a "mn"? I've never heard of this before. B, KB, MB, GB, TB, DB but mn????

**lonbgeach** · 12-12-2006

Originally Posted by Kennedy

Okay, I'm showing my ignorance here, but how much is a "mn"? I've never heard of this before. B, KB, MB, GB, TB, DB but mn????

oops, I meant MB of course

**Tonto** · 12-12-2006

No, you meant mega nuggets.

**coder8137** · 12-12-2006

Can you use a language which has native support for regular expressions, rather than calling sed all the time? Anyways, here're some regular expressions that may help (written in perl, but I think the RegEx syntax is relatively standard accross languages):

Code:

if($input_xml_sample =~ /\<([A-Za-z0-9_\-]{1,40})\>(.*?)\<\/\1\>/s ) {
  # \1 refers to contents matched within the 1st () group
  $tag_name = $1;                                                   #get contents from first () group
  $tag_contents = $2;                                              # get contents from 2nd () group
  $tag_contents =~ tr/\<\>/ /;                                    # change < and > to spaces
  $tag_contents =~ s/\&lt\;/ /g;                                 # get rid of &lt;      Repeat for &gt;
  $tag_contents =~ s/\&\#[0-9]{1,5}\;/ /g;               # matches decimal specials, i.e. &#453
  $tag_contents =~ s/\&\#x[0-9a-fA-F]{1,5}\;/ /g;  #matches hexaecimal specials, i.e. &#xA4;
  $tag_contents =~ s/^ +//;                                      # get rid of leading spaces
  $tag_contents =~ s/ +$//;                                      # trainling spaces
}

These regex are just example and assume that at least the tags are in the right format though. I.e. tag opening and closing name must match and only contain alphanumerics with no spaces.

**Rashakil Fol** · 12-12-2006

That, of course, is broken. You're expecting tags not to have any attributes? What about CDATA sections? Tag names must be less than 41 characters long? No colons, unicode characters, in tag names?

**lonbgeach** · 12-13-2006

coder8137 : thanks for the effort !
I will try to use that code.
Do you have any suggestion on how to use it to check every line ? I mean, I need to put that code inside a loop.
Thanks for your help much appreciated.

**coder8137** · 12-13-2006

That, of course, is broken. You're expecting tags not to have any attributes? What about CDATA sections? Tag names must be less than 41 characters long? No colons, unicode characters, in tag names?

It was a regular expression example based on the data in the top post, not a full XML praser. I expected whoever wants to use it to apply common sense and adjust it to whatever their needs may be. For example, if you want attributes, an example could be...

Code:

$input_xml_sample =~ /\<([A-Za-z0-9_\:\-]{1,40})[^\>]{0,120}\>(.*?)\<\/\1\>/s

Now the tag name only matches the 1st word (+ colons). Presuambly, there will then be a space and then the arguments. If you need colons.... add it to the list. If you don't like 40... pick a higher number. The reason I put in limits is because he stated that his file is not proper XML to begin with. He mentioned speed also; if you omit the limits now, and just use *s and +s, your regex is a lot worse for bad XML.

Regarding unicode. I don't see any unicode in the example at the top, but he mentioned speed. No point to introduce unicode when not needed, as this will slow things down dramatically. [A-Za-z0-9] will match 62 characters. The unicode equivalent will probably match 1,000 times more characters.

Lastly, regarding CDATA. Parsing an XML file is more than 1 RegEx. If you want to use unicode, the 1st step is probably to get all the character encodings right in whatever programming language you chose. You would then probably be served best by removing all comments from the XML. Then you might remove all CDATA. Then you might do whatever else you have to do. And then you can parse it.

**coder8137** · 12-13-2006

Do you have any suggestion on how to use it to check every line ? I mean, I need to put that code inside a loop.

Depends on your file. You said that there is only one tag per line, right? So, if you use perl, you can simply read in one line at a time and use the regex. Add an else clause to the if statement, to work out when it doesn't match. Read the above regarding Comments and CDATA, which need to be cleared 1st. If you want to use C and sed, you can again read one line, but getting $1 and $2 from sed will probably be harder. To help you any more, I'd need to know what language you will use and/or a file sample.

Thread: xml file parsing in C

Thread Tools

Search Thread

Display

xml file parsing in C

Similar Threads

Data Structure Eror

Inventory records

Unknown Memory Leak in Init() Function

System

Need a suggestion on a school project..