How would one begin?
How would one even try to begin attempting this problem? Is their any places that I can look that can help me attempt this problem?
Given a number of web-pages, you are to develop a tool to create a cross-reference for the links in the pages. Specifically, for each web-page, you will generate two lists
1. The list of files that it references.
2. The list of files that reference it.
One can use a cross-reference like this to
-identify those files that that are isolated; that is, no other file refers to them. (Note that many times ``index.html'' is an isolate file, even though it is a very useful file!)
-for each file on the list, determine which files/URLs it references. This is useful to make sure that all these files still exist.
-for each file on the list, determine all files/URLs that refer to a given file. This is useful when the web-master decides to either elimitate a page or merge it into another page.
Well the first part would probably be pretty easy, except maybe parsing an html file (I guess you'd just look for "<a href=") Once you figure out how to find all the pages that it links to, write those to a file.
You'd probably do that for each page first, then go back for each page, look through all the files you just created to see if they have link to that page.
Yea, were studying recurision so I guess thats what I have to do.
I know how to open an indivdual file. But the teacher includes like 10 html files into one folder. So how do I open multiple files, with out even knowing the names of these html files? Is that even possible.
The input to this program is the name of a file that contains all files that your program should use to generate the cross-reference. For example, suppose the input file contains the following names.
Your program will visit each of these files and for each file, it will find all ``anchor'' tags and enumerates all file-names that they reference. For example, suppose ``faculty.html'' contains:
<a href="ledin.html">George Ledin</a>
<a href="stauffer.html">Lynn Stauffer</a>
Then, your program should provide the following lists:
faculty.html refers to
faculty.html is refered to by
Note I have assumed that "index.html" through an anchor tag refers to "faculty.html"
Someone correct me if I'm wrong, but I don't believe there is a standard way to find all the files in a folder (meaning you would have to write OS-Specific code). You can look here for a solution for windows.
Actually this is going to be written in Linux. Here is the project description http://www.cs.sonoma.edu/~kooshesh/c...5project1.html
If it was just one file, I would be able to open it and use getline to read a line at a time to find <a href=. However we never learned about parsing. Here is also a sample class he showed us that we can use