-
Help w/ HTML Tags
Hey everyone,
I am trying to writed a function that will remove all Tags from an HTML file so that only the actual text remains. I've got my code written but it just isnt working right, its keeping brackets and other crap when it shouldnt.
Has anyone done this before that could point me in the right direction w/ some pseudocode, not even that, just a desc of how you would do it.
Thanks
The Landroid
-
Just curious, why do you want to remove the tags from a html file? if you want to get the text of a html file, you just view source and copy the text portion.
-
Well i am building a search engine.
-
Well i dont know much about building a search engine but instead of removing all the html tags, ignoring them wouldnt give you the same result?
-
Well, that is in a sense what i am doing. I'm just reading in the html line by line and then saving everything that is not a tag to one big char*.
problem is, that sometimes i skip an entire line due to it all being tags, so it saves some of the spaces from that line. and my stack is not working for the brackets that are in the text, so i am trying to figure out a simple algorithm for this. and my 1st try didnt do so well.
The Landroid
-
I've built something like that... I set it up so that it was reading a stream (file), and when it encountered a "<X", where 'X' is any alphanumeric character or slash, it would ignore everything until the next "Y>" (Y being another alphanumeric character or quote, not necessarily the same) was encountered. this has it's pitfalls though, consider this page:
Code:
<HTML>
<HEAD><TITLE>Parse ME!</TITLE></HEAD>
<BODY>
5<6
</BODY>
</HTML>
vv edit vv
here's some code I wrote that you may be able to use: It goes through an HTML page and converts all the HTML tags to uppercase. for example, it will turn this:
Code:
<html>
<head><title>Parse ME!</title></head>
<body>
<div name="DivOne">
test
</div>
</body>
</html>
into this:
Code:
<HTML>
<HEAD><TITLE>Parse ME!</TITLE></HEAD>
<BODY>
<DIV NAME="DivOne">
test
</DIV>
</BODY>
</HTML>
here's the source (it's very messy and bug-ridden for now):
Code:
/*
HTML
DESCRIPTION:
This program is meant to go through HTML and make everything inside the
tags uppercase, with the exception of things in quotes (literals).
KNOWN BUGS:
-> If you have a script or something in your page that uses the
comparison operator '<' then you may have a problem because this
program will read it as an opening tag and it won't stop until it
reaches the next '>' comparison operator or the closing end of a
legal HTML tag.
-> Alternatively, if you have any HTML-style tags in your text, or a '<'
character, the same thing will happen. text will be turned into
uppercase characters until a '>' is reached.
-> HTML comments starting with <!-- will also be turned uppercase until
the program reaches the first '>' in the comment or at the end. My
suggestion is for you to use javascript to comment out what you
don't want, because that style of commenting won't be hurt by this
program
-> The program outputs the last character twice, but only into the file
and not on the screen...
-> Literals must be in double quotes, because single quotes are also
used as apostrophes, and if there's an apostrophe without a closing
quote (single or double), the program would run off the end of the
document and keep on chugging along.
John Shao [[email protected]]
*/
#include<iostream> //for standard input and output
#include<fstream> //for file input and output
#include<string> //for the string class
void copyFile(std::ifstream&,std::ofstream&,std::string); //copy the temp file to the perm file
void deleteFile(std::string); //delete the temp file
void putTag(std::ofstream&,std::string&,const bool); //to output the current tag
int main()
{
const bool DEBUG_LOOPS=false; //true to show loops
const bool SHOW_VERBOSE=true; //true to show the file being processed
const bool DEBUG_FILENAME=false; //true to show filename
std::ifstream infile; //for input file stream
std::ofstream ofile; //for output file stream
std::string filename; //for the filename
std::string tag; //for the tag name
char ch; //to hold each character
for(;;) //main program loop
{
filename.clear(); //clear the filename
std::cout<<"Enter the Filename: "; //prompt for the filename
std::getline(std::cin,filename,'\n'); //take in the filename
if(DEBUG_FILENAME)
{
std::cout<<filename.c_str()<<std::endl;
std::cin.get();
}
if(filename=="~") //if the user wanted to exit
return 0; //get out of the program
try{ //use exception handling
infile.open(filename.c_str(),std::ios::in); //initialize the input stream
if(!infile) //if the stream has a fatal flag set
throw("Input File Not Opened"); //throw an exception
filename="~"+filename; //add a tilde to the beginning of the filename
ofile.open(filename.c_str(),std::ios::trunc); //initialize the output stream
if(!ofile) //if the stream has a fatal flag set
throw("Output File Not Opened"); //throw an exception
while(infile.get(ch)) //loop through the entire file
{
if(DEBUG_LOOPS)
std::cerr<<"DEBUG_LOOPS: while(!infile.eof())\n";
if(ch!='<') //if it's not the beginning of a tag
{
ofile<<ch;
if(SHOW_VERBOSE)
std::cout<<ch;
}
else //if it's the beginning of a tag
{
while(ch!='>') //while it's not at the end of the tag
{
if(DEBUG_LOOPS)
std::cerr<<"DEBUG_LOOPS: while(ch!='>')\n";
if(ch=='\"') //if it's a literal
{
putTag(ofile,tag,SHOW_VERBOSE);
//output what you have of the tag
do{ //go through that literal
if(DEBUG_LOOPS)
std::cerr<<"DEBUG_LOOPS: do{...}while(ch!='\'' && ch!='\"');\n";
ofile<<ch; //put it back as it was
if(SHOW_VERBOSE)
std::cout<<ch;
infile.get(ch); //take it directly as is
}while(ch!='\"'); //do that while still in a literal
}
tag+=ch; //add the character to the tag
infile.get(ch); //take in a character
}
putTag(ofile,tag,SHOW_VERBOSE);
//output the rest of the tag
ofile<<ch; //put the '>' back
if(SHOW_VERBOSE)
std::cout<<ch;
}
} //loop through the file
} catch(char*e) { //catch any exceptions
std::cerr<<e; //output any exceptions
std::cin.get();
exit(1);
}
ofile.close();
ofile.flush();
ofile.clear();
infile.close();
infile.clear();
copyFile(infile,ofile,filename);
deleteFile(filename);
std::cout<<std::endl; //flush the buffer and put the first line on a new line
} //main program loop
}
void putTag(std::ofstream &ofile,std::string &tag,const bool SHOW_VERBOSE)
{
for(int index=0;index<tag.length();index++) //go through the string
tag[index]=toupper(tag[index]); //turn everything into uppercase
ofile<<tag.c_str(); //output it to the file
if(SHOW_VERBOSE)
std::cout<<tag.c_str();
tag.clear(); //cear the contents of the tag
}
void copyFile(std::ifstream &infile,std::ofstream &ofile,std::string filename)
{
char ch; //to hold each character
filename=filename.substr(1); //get the original filename back
ofile.open(filename.c_str(),std::ios::trunc); //open the original file
filename="~"+filename; //turn it back into the temp filename
infile.open(filename.c_str(),std::ios::in); //open the temp file
while(!infile.eof()) //loop through the temp file
{
infile.get(ch); //take in every character
ofile<<ch; //output every character
}
ofile.close(); //close the output file stream
ofile.flush(); //flush the output file stream
ofile.clear(); //clear the output file stream flags
infile.close(); //close the input file stream
infile.clear(); //clear the input file stream flags
}
void deleteFile(std::string filename)
{
filename="del " + filename; //the windows command to delete a file
system(filename.c_str()); //execute the command
}