Thread: How to extract variables from between html tags?

  1. #1
    Registered User
    Join Date
    Oct 2009
    Posts
    4

    How to extract variables from between html tags?

    Suppose something like this:

    Code:
    <span class="gugugug">1</span>
    
    <span class="krakaka">enchanted</span><span class="gagaga">chocolate bar</span>
    
    <span class="gugugug">2</span>
    
    <span class="krakaka">very remarkable</span><span class="gagaga">flavored cookies</span>
    
    <span class="gugugug">3</span>
    
    <span class="krakaka">fascinating</span><span class="gagaga">strawberries</span>
    
    (...)
    
    <span class="gugugug">254</span>
    
    <span class="krakaka">amazing</span><span class="gagaga">pineapples</span>
    
    (...)
    And it goes on.

    I need to extract the data between the <span> tags and put it in a bidimensional array. But what piece of code do I need in order to extract the words between the tags?

  2. #2
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    look at libxml. it can also be used with html, and I've heard it works quite well.

  3. #3
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    The 'piece of code' is called an HTML parser.

  4. #4
    Registered User
    Join Date
    Oct 2009
    Posts
    4
    Well, I don't really want to do anything other than extract the words between the tags. I want "1", "enchanted" and "chocolate bar" to be on the first vertical row of the bidimensional array, and "2", "very remarkable" and "flavored cookies" on the second vertical row of the bidimensional array, and so on.

    "gugugug", "krakaka" and "gagaga" are not important. So basically, it's not important that it's html.

    I know that if I use this:

    Code:
    char lol[] = { '<', 's', 'p', 'a', 'n', ' ', 'c', 'l', 'a', 's', 's', '=', '"', 'g', 'u', 'g', 'u', 'g', 'u', 'g', '"', '>', '\0' };
    I can treat the html bit as a simple string that I want to ignore.

    But I don't know how to get what is between the <span> and </span>.

    Is there a way to tell C++ to go to the end of <span class="gugugug">, add what is between the > and the </span> to an array, and then go to <span class="krakaka">, get what is between the > and < over there, and then continue in that style?

    I'm thinking it's simple code, so can you help me out?
    Last edited by purpleturple; 10-02-2009 at 02:18 PM.

  5. #5
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    And what if you get:

    Code:
    <span    class  =   'gugugug'         
    
    >
    >> I can treat the html bit as a simple string that I want to ignore.

    Sure, if you want to produce a fragile, broken piece of software.

    >> Is there a way to tell C++ to go to the end of <span class="gugugug">, add what is between the > and the </span> to an array, and then go to <span class="krakaka">, get what is between the > and < over there, and then continue in that style?

    You can 'tell C++' whatever you like, but the bottom line is you're going to either need to parse the the data byte-by-byte, or find a library/API that can do it for you.

    Sounds like you need the data in an easier format, though, like CSV or such.

  6. #6
    Registered User
    Join Date
    Oct 2009
    Posts
    4
    Well, it's not a lot of data. I want to be able to get the important bits from a few small html files. Parsing byte-by-byte is fine.

    It's as though, instead of html, I had something like this:

    Code:
    >1<
    
    >enchanted<>chocolate bar<
    
    >2<
    
    >very remarkable<>flavored cookies<
    
    >3<
    
    >fascinating<>strawberries<
    
    (...)
    
    >254<
    
    >amazing<>pineapples<
    
    (...)
    I'm new, but you're really good, so you can help me, please? If you show me how to get what's between the > and <, I can build a loop with that.

  7. #7
    Deprecated Dae's Avatar
    Join Date
    Oct 2004
    Location
    Canada
    Posts
    1,034
    Haha purpleturple, inventing your own format? Just use XML or JSON, and use a library to parse it. I had issues with JSON, even though it's great, so I recommend XML (it's just very simple/structured HTML). I use Boost.Serialize (the xml parser, which probably just wraps libxml)

    You *could* also use Boost.Regex easily if the format is simple (which yours seems to be). You just need to figure out the correct regular expression.

    Quote Originally Posted by Sebastiani View Post
    And what if you get:

    Code:
    <span    class  =   'gugugug'         
    
    >
    Strip/format the code before parsing, or have procedures to ignore whitespace at certain stages in the tree. I don't think that's an issue since he seems to be in control of his own format and files. However he should probably use a library just in case.
    Last edited by Dae; 10-02-2009 at 03:21 PM.
    Warning: Have doubt in anything I post.

    GCC 4.5, Boost 1.40, Code::Blocks 8.02, Ubuntu 9.10 010001000110000101100101

  8. #8
    Registered User
    Join Date
    Oct 2009
    Posts
    4
    Actually, I'm completely new to programming. I was just hoping there was a really simple, short solution, something about getting to the end of the <span class="whatever">, and then using getc until it gets to the <, and then just adding what's in-between the brackets to its appropriate position in the array, and then continuing the loop. (Which I don't know how to do.)

    Boost.Regex looks very complicated to me at this point, but I suppose I'll have to find a way to get my head around it if there's no other way. Though it almost looks like using a Rube Goldberg machine.
    Last edited by purpleturple; 10-02-2009 at 03:57 PM.

  9. #9
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    There is such a function as find, if that's what you're looking for. (Or alternatively, you can tell getline where to stop reading.)

  10. #10
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    [edit] Drat, that "completely new to programming" post wasn't there when I started this. Sorry about that. I expect this is a bit too complicated. [/edit]

    It might be an interesting exercise for purpleturple to try to create a simple parser, nevertheless. I should note right here that creating your own parser is quite a bit of work, and it probably won't work perfectly, and so on; using someone else's code will make your life easier. But if you want to try it, or just wonder how such a library might go about the parsing (note that it's almost certainly more sophisticated than what I present here . . .), read on.

    The first thing to note is that your program needs to tell the difference between being "inside" a tag, and not. A proper parser would need to tell whether it's inside a start tag or an end tag, and remember the names of the tags to ensure that opening tags are matched by proper closing tags, and a lot of other information. But if you're willing to assume that your input is valid, and you are only interested in looking at certain parts of the file, you can get away with less.

    Let's start with figuring out whether you're inside a tag or not. It's really quite simple. Maintain a variable which either holds the value "inside" or "outside". When you see a '<', set it to "inside"; when you see a '>', set it to outside. I think you can probably figure that part out. Something like this would do the trick:
    Code:
    #include <iostream>
    #include <fstream>
    #include <string>
    
    void parse(std::istream &stream);
    
    int main(int argc, char *argv[]) {
        if(argc < 1) {
            std::cerr << "Usage: " << argv[0] << " <file.xml>\n";
            return 1;
        }
        
        std::ifstream stream(argv[1]);
        if(!stream.is_open()) {
            std::cerr << "Can't open \"" << argv[1] << "\"\n";
            return 1;
        }
        
        parse(stream);
        
        return 0;
    }
    
    void parse(std::istream &stream) {
        bool inside_tag = false;
        
        std::string line;
        while(std::getline(stream, line)) {
            for(std::string::size_type x = 0; x < line.length(); x ++) {
                if(line[x] == '<') {
                    inside_tag = true;
                }
                else if(line[x] == '>') {
                    inside_tag = false;
                }
                else {
                    // it's not a special character
                    if(!inside_tag) {
                        std::cout << line[x];
                    }
                }
            }
            
            std::cout << std::endl;
        }
    }
    That just prints any data which lies outside any XML (or HTML) tags. I'm using as my input
    Code:
    <?xml version="1.0"?>
    <root>
        <something>Greetings.</something>
        <something>Hello.</something>
    </root>
    Now you need to do some more processing when you're inside a tag, to find its name (if you care about it), attributes, and so on. It helps when you're doing this to have a get_token() function or equivalent, which skips whitespace and reads the next word in your input. But you have to be careful that as you're doing so, you don't miss any '<'s or '>'s which would change your state.

    Then once you get that working, you have to consider the first token inside a tag to be its name. Subsequent tokens should be followed by the token "=", and then some sort of double-quote token (or a single quote, if you allow those). And so on. Note that these are different types of tokens here; while looking for attribute names etc you want a token of alphanumeric characters; while looking for '=' or '"' you want a single, non-whitespace character token.

    It gets complicated, as you see. Maybe you might want to look into some XML parsing library now . . . ?
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  11. #11
    Registered User
    Join Date
    Dec 2007
    Posts
    2,675
    Yeah, this is an "oh, programming must be easy" post. Not quite, cupcake. It's why some of us get paid the big bucks

  12. #12
    Registered User
    Join Date
    Apr 2004
    Location
    Ohio
    Posts
    147
    Stop using char pointers and use string and stringstream's. They have all the functions you need to do whatever you want. Lookup std::string, std::stringstream and their respective built-in functions. Believe me, it's far easier to use those than to try to do it manually.

    Dae's suggestion about using Boost is good if you're able to get the libraries installed in your compiler paths but if you're going for something simple stick with the standard STL containers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Declaring an variable number of variables
    By Decrypt in forum C++ Programming
    Replies: 8
    Last Post: 02-27-2005, 04:46 PM
  2. hwnd and variables in them
    By underthesun in forum Windows Programming
    Replies: 6
    Last Post: 01-16-2005, 06:39 PM
  3. HTML tags
    By netboy in forum A Brief History of Cprogramming.com
    Replies: 4
    Last Post: 03-27-2002, 07:52 AM
  4. Replies: 2
    Last Post: 09-10-2001, 12:00 PM