Thread: UTF-8 and string/char functions

  1. #1
    Guest
    Guest

    UTF-8 and string/char functions

    I'm working with a UTF-8 XML file and want to extract the characters between the tags:
    Code:
    <tag>This is a line of content.</tag>
    ^ Extract: This is a line of content.

    I also want to extract characters between other delimiters within such text:
    Code:
    This is [a line] of content.
    ^ Extract: a line

    I don't intend to use an XML parser, because this seems trivial to implement. However I'm concerned about how character comparisons work with UTF-8 strings.

    From what I understand, UTF-8 characters/symbols can occupy more than one char. So if I were to slice up the string based on a particular delimiter, do I run the risk of said delimiter looking the same as the second byte of an exotic UTF-8 character?
    Code:
    // the following string would not be a literal, defined inline, but come from an external file.
    std::string s = "Here is a ♥, but I only want [this piece] within the text";
    std::string::size_type positon = s.find("["); // find start of delimiter
    Now the ♥ is probably ASCII, but if it were e.g. a Sanskrit letter, is there a possibility that its second byte is equal to [, causing a false positive? Or is UTF-8 designed in a way that excludes this from ever happening?

    Additionally I'd like to work with raw chars instead of std::string functions because I think it would fit my matching algorithm better. Since std::string is basically an array of chars, UTF-8 character comparison should behave similar among the two, correct?

  2. #2
    Registered User
    Join Date
    Dec 2011
    Posts
    26
    Quote Originally Posted by Guest View Post
    do I run the risk of said delimiter looking the same as the second byte of an exotic UTF-8 character?
    No, continuation-bytes are distinct from one-byte-codes.

  3. #3
    Guest
    Guest
    I see. So to be sure I don't misunderstand, would be following code snippet I wrote be safe from false positives?
    Code:
    const char* utf_string = u8"<tag>୭ୗୱ୯୯ମଝ[૱ଅଈଆଞ]ଫ୩ਸ਼ਨ</tag>"; // this syntax may be wrong, but you get the idea
    const unsigned int string_length = std::char_traits<char>::length(utf_string);
    char result[string_length]; // enough room to hold chars between brackets
    unsigned int added = 0; // counts bytes between brackets
    bool in_brackets = false; // are we within brackets?
    
    for(unsigned int i = 0; i != string_length; ++i) {
        if(!in_brackets && s[i] == '[') { // we open
            in_brackets = true;
        } else if(in_brackets && s[i] == ']') { // we close
            in_brackets = false;
        } else if(in_brackets) { // we're in brackets, we add
            result[added++] = s[i];
        }
    }
    When I run it, result indeed holds the chars within the brackets. But are you sure that there aren't 2, 3 or 4-byte utf-8 symbols, part of whom return true for (s[i] == '[')?

    Thanks.
    Last edited by Guest; 03-07-2014 at 01:46 PM.

  4. #4
    Registered User
    Join Date
    Dec 2011
    Posts
    26
    Quote Originally Posted by Guest View Post
    are you sure?
    Yes, I am.

  5. #5
    Guest
    Guest
    Ok, thanks! This keeps things a lot simpler than I had worried they might be.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char (string) array functions
    By jcfuller in forum C Programming
    Replies: 4
    Last Post: 06-24-2013, 11:31 AM
  2. Replies: 2
    Last Post: 12-02-2012, 05:25 AM
  3. Question about functions of string vs char string
    By Robertjh12 in forum C++ Programming
    Replies: 2
    Last Post: 07-07-2011, 03:13 AM
  4. Replies: 11
    Last Post: 06-16-2011, 11:59 AM
  5. Replies: 2
    Last Post: 09-12-2010, 09:15 AM

Tags for this Thread