I don't think you can do this with ksh (becuase I don't think it allows you to refer to what previously matched in the same pattern, i.e. \1). But since you were just going calling sed all the time before, you could just call an awk or perl script that will do everything in one go. I don't know much about awk, but I pieced together a perl script from the examples I posted earlier. A lot of unix installations come with perl. Calling it with ./my_script.pl should be enough.
Anyways, you didn't cofirm that you only have 1 tag pair / line, but I assumed this is so based on what you said earlier. If you use it, make sure to test it on your files. The script will harshly replace extra <s and >s with spaces or altogether if they are the last or first thing in the xml_contents_string.
This script won't normally print to STDOUT, but straight to file... change if you need. Will
warn/die if something is completly wrong (i.e. a line where nothing matched), but not if some data is discarded bc there is more than 1 tag pair perl line.
Code:
#! /usr/bin/perl
# or try /usr/local/bin/perl if above doesn't work
use warnings;
use strict;
open(INPUT_FILE, '<my_xml_file') or die "Can't open input file: $!";
open(OUTPUT_FILE, '>my_xml_output_file') or die "Can't open output file: $!";
#the above will truncate output file, if it exists
my ($input_xml_sample, $tag_name, $tag_contents);
my $line_num = 1;
while($input_xml_sample = <INPUT_FILE>) {
# reading one line at a time. This assumes strictly one tag pair per line and no more!
# Anything not part of the tag or the contents will be discared
# Example: " <some_tag:abc> Some random ><useless> text > all </some_tag:abc> "
# Bad: " <just_an_open_tag> on this line "
# Bad: " some text <just_a:closer /> " (Only tag name will make it to file)
if($input_xml_sample =~ /\<([A-Za-z0-9_\-\:]{1,40})\>(.*?)\<\/\1\>/s ) {
# \1 refers to contents matched within the 1st () group
# does not handle tag atrributes. If Tages contain anyhting other than
# A-Za-z0-9_-: or is longer than 40 than edit above group
$tag_name = $1; #get contents from first () group
$tag_contents = $2; # get contents from 2nd () group
$tag_contents =~ tr/\<\>/ /; # change < and > to spaces
$tag_contents =~ s/\<\;/ /g; # get rid of <
$tag_contents =~ s/\>\;/ /g;
$tag_contents =~ s/\&\#[0-9]{1,5}\;/ /g; # matches decimal specials, i.e. Dž Change as needed!!
$tag_contents =~ s/\&\#x[0-9a-fA-F]{1,5}\;/ /g; #matches hexaecimal specials, i.e. ¤
$tag_contents =~ s/^ +//; # get rid of leading spaces
$tag_contents =~ s/ +$//; # trainling spaces
print OUTPUT_FILE "<$tag_name>$tag_contents</$tag_name>\n";
}
elsif($input_xml_sample =~ /\<([A-Za-z0-9_\-\:]{1,40}) {0,10}\/\>/ ) {
# empty tag, i.e. "<tag />"
print OUTPUT_FILE "<$1 />\n";
}
else {
if($input_xml_sample ne "\n") {
warn("Couldn't use line $line_num at all!");
}
}
$line_num++;
}
close(INPUT_FILE);
close(OUTPUT_FILE);