I'm working with a UTF-8 XML file and want to extract the characters between the tags:
Code:
<tag>This is a line of content.</tag>
^ Extract: This is a line of content.
I also want to extract characters between other delimiters within such text:
Code:
This is [a line] of content.
^ Extract: a line
I don't intend to use an XML parser, because this seems trivial to implement. However I'm concerned about how character comparisons work with UTF-8 strings.
From what I understand, UTF-8 characters/symbols can occupy more than one char. So if I were to slice up the string based on a particular delimiter, do I run the risk of said delimiter looking the same as the second byte of an exotic UTF-8 character?
Code:
// the following string would not be a literal, defined inline, but come from an external file.
std::string s = "Here is a ♥, but I only want [this piece] within the text";
std::string::size_type positon = s.find("["); // find start of delimiter
Now the ♥ is probably ASCII, but if it were e.g. a Sanskrit letter, is there a possibility that its second byte is equal to [, causing a false positive? Or is UTF-8 designed in a way that excludes this from ever happening?
Additionally I'd like to work with raw chars instead of std::string functions because I think it would fit my matching algorithm better. Since std::string is basically an array of chars, UTF-8 character comparison should behave similar among the two, correct?