You could try an HTML parser. They're good at broken stuff.
Printable View
You could try an HTML parser. They're good at broken stuff.
The XML file claims to be iso-8859-1. You cannot just use it as if it were UTF-8. In iso-8859-1 all characters are made up of one byte. In UTF-8, the non-ASCII characters (i.e. letters with accent marks / diacritics, like the ones you posted) are made up of 2 bytes. So in iso-8859-1, if you know the hex value of the characer you are looking for, you can always easily match it, as long as you don't play around with the encoding.Quote:
I think there are basically 4 special characters
You can try what Salem posted, but I don't think it will work very well, because the dash ( - ) is only supposed to be used for simple sequences (like a-g for example) as far as I know. Here's some more things you can try:
Or, to do things more correctly, you can use Encode/Decode. ( http://perldoc.perl.org/Encode.html ) First decode data from iso-8859-1. Then you should be able to match all your special characters using their unicode if you like. But also, if you decode data from iso-8859-1 and then encode it as ASCII, anything non ASCII should automatically be replaced with "?". Instead of using Decode explictly, you can try to just change the calls to open if you use PerlIO.Code:$tag_contents =~ tr/A-Za-z0-9\.\_ / /c;
# This will space out all characters except those listed! Add to the list anything you need
# to keep.