Can you use a language which has native support for regular expressions, rather than calling sed all the time? Anyways, here're some regular expressions that may help (written in perl, but I think the RegEx syntax is relatively standard accross languages):
Code:
if($input_xml_sample =~ /\<([A-Za-z0-9_\-]{1,40})\>(.*?)\<\/\1\>/s ) {
# \1 refers to contents matched within the 1st () group
$tag_name = $1; #get contents from first () group
$tag_contents = $2; # get contents from 2nd () group
$tag_contents =~ tr/\<\>/ /; # change < and > to spaces
$tag_contents =~ s/\<\;/ /g; # get rid of < Repeat for >
$tag_contents =~ s/\&\#[0-9]{1,5}\;/ /g; # matches decimal specials, i.e. Dž
$tag_contents =~ s/\&\#x[0-9a-fA-F]{1,5}\;/ /g; #matches hexaecimal specials, i.e. ¤
$tag_contents =~ s/^ +//; # get rid of leading spaces
$tag_contents =~ s/ +$//; # trainling spaces
}
These regex are just example and assume that at least the tags are in the right format though. I.e. tag opening and closing name must match and only contain alphanumerics with no spaces.