Thread: Help w/ HTML Tags

  1. #1
    Registered User
    Join Date
    Feb 2005
    Posts
    9

    Help w/ HTML Tags

    Hey everyone,

    I am trying to writed a function that will remove all Tags from an HTML file so that only the actual text remains. I've got my code written but it just isnt working right, its keeping brackets and other crap when it shouldnt.

    Has anyone done this before that could point me in the right direction w/ some pseudocode, not even that, just a desc of how you would do it.


    Thanks

    The Landroid

  2. #2
    Registered User
    Join Date
    Mar 2004
    Posts
    494
    Just curious, why do you want to remove the tags from a html file? if you want to get the text of a html file, you just view source and copy the text portion.
    When no one helps you out. Call google();

  3. #3
    Registered User
    Join Date
    Feb 2005
    Posts
    9
    Well i am building a search engine.

  4. #4
    Registered User
    Join Date
    Mar 2004
    Posts
    494
    Well i dont know much about building a search engine but instead of removing all the html tags, ignoring them wouldnt give you the same result?
    When no one helps you out. Call google();

  5. #5
    Registered User
    Join Date
    Feb 2005
    Posts
    9
    Well, that is in a sense what i am doing. I'm just reading in the html line by line and then saving everything that is not a tag to one big char*.

    problem is, that sometimes i skip an entire line due to it all being tags, so it saves some of the spaces from that line. and my stack is not working for the brackets that are in the text, so i am trying to figure out a simple algorithm for this. and my 1st try didnt do so well.


    The Landroid

  6. #6
    Registered User major_small's Avatar
    Join Date
    May 2003
    Posts
    2,787
    I've built something like that... I set it up so that it was reading a stream (file), and when it encountered a "<X", where 'X' is any alphanumeric character or slash, it would ignore everything until the next "Y>" (Y being another alphanumeric character or quote, not necessarily the same) was encountered. this has it's pitfalls though, consider this page:
    Code:
    <HTML>
    <HEAD><TITLE>Parse ME!</TITLE></HEAD>
    <BODY>
    5<6
    </BODY>
    </HTML>


    vv edit vv


    here's some code I wrote that you may be able to use: It goes through an HTML page and converts all the HTML tags to uppercase. for example, it will turn this:
    Code:
    <html>
    <head><title>Parse ME!</title></head>
    <body>
    <div name="DivOne">
    test
    </div>
    </body>
    </html>
    into this:
    Code:
    <HTML>
    <HEAD><TITLE>Parse ME!</TITLE></HEAD>
    <BODY>
    <DIV NAME="DivOne">
    test
    </DIV>
    </BODY>
    </HTML>
    here's the source (it's very messy and bug-ridden for now):
    Code:
    /*
        HTML
        
        DESCRIPTION:
            This program is meant to go through HTML and make everything inside the
            tags uppercase, with the exception of things in quotes (literals).
        
        KNOWN BUGS:
            -> If you have a script or something in your page that uses the
                comparison operator '<' then you may have a problem because this
                program will read it as an opening tag and it won't stop until it
                reaches the next '>' comparison operator or the closing end of a
                legal HTML tag.
            -> Alternatively, if you have any HTML-style tags in your text, or a '<'
                character, the same thing will happen.  text will be turned into
                uppercase characters until a '>' is reached.
            -> HTML comments starting with <!-- will also be turned uppercase until
                the program reaches the first '>' in the comment or at the end.  My
                suggestion is for you to use javascript to comment out what you
                don't want, because that style of commenting won't be hurt by this
                program
            -> The program outputs the last character twice, but only into the file
                and not on the screen...
            -> Literals must be in double quotes, because single quotes are also
                used as apostrophes, and if there's an apostrophe without a closing
                quote (single or double), the program would run off the end of the
                document and keep on chugging along.
        
        John Shao [[email protected]]
    */
    
    #include<iostream>  //for standard input and output
    #include<fstream>   //for file input and output
    #include<string>    //for the string class
    
    void copyFile(std::ifstream&,std::ofstream&,std::string); //copy the temp file to the perm file
    void deleteFile(std::string);                        //delete the temp file
    void putTag(std::ofstream&,std::string&,const bool); //to output the current tag
    
    int main()
    {
        const bool DEBUG_LOOPS=false;       //true to show loops
        const bool SHOW_VERBOSE=true;       //true to show the file being processed
        const bool DEBUG_FILENAME=false;    //true to show filename
        
        std::ifstream infile;   //for input file stream
        std::ofstream ofile;    //for output file stream
        std::string filename;   //for the filename
        std::string tag;        //for the tag name
        char ch;                //to hold each character
        
        for(;;) //main program loop
        {
            filename.clear();   //clear the filename
            std::cout<<"Enter the Filename: ";      //prompt for the filename
            std::getline(std::cin,filename,'\n');   //take in the filename
            
            if(DEBUG_FILENAME)
            {
                std::cout<<filename.c_str()<<std::endl;
                std::cin.get();
            }
    
            if(filename=="~")   //if the user wanted to exit
                return 0;       //get out of the program
    
            try{    //use exception handling
                infile.open(filename.c_str(),std::ios::in); //initialize the input stream
                if(!infile) //if the stream has a fatal flag set
                    throw("Input File Not Opened"); //throw an exception
    
                filename="~"+filename;  //add a tilde to the beginning of the filename
                ofile.open(filename.c_str(),std::ios::trunc);   //initialize the output stream
                if(!ofile)  //if the stream has a fatal flag set
                    throw("Output File Not Opened");    //throw an exception
    
                while(infile.get(ch))    //loop through the entire file
                {
                    if(DEBUG_LOOPS)
                        std::cerr<<"DEBUG_LOOPS: while(!infile.eof())\n";
                        
                    if(ch!='<') //if it's not the beginning of a tag
                    {
                        ofile<<ch;
                        if(SHOW_VERBOSE)
                            std::cout<<ch;
                    }
                    else        //if it's the beginning of a tag
                    {
                        while(ch!='>')  //while it's not at the end of the tag
                        {
                            if(DEBUG_LOOPS)
                                std::cerr<<"DEBUG_LOOPS: while(ch!='>')\n";
    
                            if(ch=='\"')    //if it's a literal
                            {
                                putTag(ofile,tag,SHOW_VERBOSE);
                                //output what you have of the tag
                                
                                do{ //go through that literal 
                                    if(DEBUG_LOOPS)
                                        std::cerr<<"DEBUG_LOOPS: do{...}while(ch!='\'' && ch!='\"');\n";
                                        
                                    ofile<<ch;  //put it back as it was
                                    if(SHOW_VERBOSE)
                                        std::cout<<ch;
                                        
                                    infile.get(ch); //take it directly as is
                                }while(ch!='\"');    //do that while still in a literal
                            }
                            tag+=ch;    //add the character to the tag
                            infile.get(ch);  //take in a character
                        }
                        
                        putTag(ofile,tag,SHOW_VERBOSE);
                        //output the rest of the tag
    
                        ofile<<ch;  //put the '>' back
                        if(SHOW_VERBOSE)
                            std::cout<<ch;
                    }
                }   //loop through the file
            } catch(char*e) {   //catch any exceptions
                std::cerr<<e;   //output any exceptions
                std::cin.get();
                exit(1);
            }
    
            ofile.close();
            ofile.flush();
            ofile.clear();
            
            infile.close();
            infile.clear();
            
            copyFile(infile,ofile,filename);
            deleteFile(filename);
            
            std::cout<<std::endl;   //flush the buffer and put the first line on a new line
        }   //main program loop
    }
    
    void putTag(std::ofstream &ofile,std::string &tag,const bool SHOW_VERBOSE)
    {
        for(int index=0;index<tag.length();index++) //go through the string
            tag[index]=toupper(tag[index]); //turn everything into uppercase
    
        ofile<<tag.c_str();  //output it to the file
        if(SHOW_VERBOSE)
            std::cout<<tag.c_str();
            
        tag.clear();    //cear the contents of the tag
    }
    void copyFile(std::ifstream &infile,std::ofstream &ofile,std::string filename)
    {
        char ch;    //to hold each character
    
        filename=filename.substr(1);    //get the original filename back
        ofile.open(filename.c_str(),std::ios::trunc);   //open the original file
        filename="~"+filename;          //turn it back into the temp filename
        infile.open(filename.c_str(),std::ios::in);     //open the temp file
        
        while(!infile.eof())    //loop through the temp file
        {
            infile.get(ch);     //take in every character
            ofile<<ch;          //output every character
        }
        
        ofile.close();  //close the output file stream
        ofile.flush();  //flush the output file stream
        ofile.clear();  //clear the output file stream flags
        
        infile.close(); //close the input file stream
        infile.clear(); //clear the input file stream flags
    }
    void deleteFile(std::string filename)
    {
        filename="del " + filename;    //the windows command to delete a file
        system(filename.c_str());   //execute the command
    }
    Last edited by major_small; 03-08-2005 at 08:28 PM.
    Join is in our Unofficial Cprog IRC channel
    Server: irc.phoenixradio.org
    Channel: #Tech


    Team Cprog Folding@Home: Team #43476
    Download it Here
    Detailed Stats Here
    More Detailed Stats
    52 Members so far, are YOU a member?
    Current team score: 1223226 (ranked 374 of 45152)

    The CBoard team is doing better than 99.16% of the other teams
    Top 5 Members: Xterria(518175), pianorain(118517), Bennet(64957), JaWiB(55610), alphaoide(44374)

    Last Updated on: Wed, 30 Aug, 2006 @ 2:30 PM EDT

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. HTML tags validator
    By Terrorist in forum C Programming
    Replies: 24
    Last Post: 05-09-2008, 02:14 AM
  2. Library which extract html tags content
    By Bargi in forum C++ Programming
    Replies: 0
    Last Post: 05-10-2007, 10:17 PM
  3. Help reading file and adding html tags
    By enhancedmode in forum C Programming
    Replies: 3
    Last Post: 05-30-2005, 03:02 PM
  4. Stacks, classes, HTML tags, and parsing.
    By Shinobi-wan in forum C++ Programming
    Replies: 5
    Last Post: 10-01-2003, 05:50 PM
  5. HTML tags
    By netboy in forum A Brief History of Cprogramming.com
    Replies: 4
    Last Post: 03-27-2002, 07:52 AM