-
How would one begin?
How would one even try to begin attempting this problem? Is their any places that I can look that can help me attempt this problem?
Given a number of web-pages, you are to develop a tool to create a cross-reference for the links in the pages. Specifically, for each web-page, you will generate two lists
1. The list of files that it references.
2. The list of files that reference it.
One can use a cross-reference like this to
-identify those files that that are isolated; that is, no other file refers to them. (Note that many times ``index.html'' is an isolate file, even though it is a very useful file!)
-for each file on the list, determine which files/URLs it references. This is useful to make sure that all these files still exist.
-for each file on the list, determine all files/URLs that refer to a given file. This is useful when the web-master decides to either elimitate a page or merge it into another page.
-
Well the first part would probably be pretty easy, except maybe parsing an html file (I guess you'd just look for "<a href=") Once you figure out how to find all the pages that it links to, write those to a file.
You'd probably do that for each page first, then go back for each page, look through all the files you just created to see if they have link to that page.
-
Yea, were studying recurision so I guess thats what I have to do.
-
I know how to open an indivdual file. But the teacher includes like 10 html files into one folder. So how do I open multiple files, with out even knowing the names of these html files? Is that even possible.
----------------------------------------------------------------------------------
The input to this program is the name of a file that contains all files that your program should use to generate the cross-reference. For example, suppose the input file contains the following names.
index.html
faculty.html
Your program will visit each of these files and for each file, it will find all ``anchor'' tags and enumerates all file-names that they reference. For example, suppose ``faculty.html'' contains:
<thml><head><title>Departmental Faculty</title></head>
<body>
<H3>Professor</H3>
<a href="ledin.html">George Ledin</a>
<a href="stauffer.html">Lynn Stauffer</a>
...
</body>
</html>
Then, your program should provide the following lists:
faculty.html refers to
ledin.html
stauffer.html
faculty.html is refered to by
index.html
Note I have assumed that "index.html" through an anchor tag refers to "faculty.html"
-
Someone correct me if I'm wrong, but I don't believe there is a standard way to find all the files in a folder (meaning you would have to write OS-Specific code). You can look here for a solution for windows.
-
Actually this is going to be written in Linux. Here is the project description http://www.cs.sonoma.edu/~kooshesh/c...5project1.html
If it was just one file, I would be able to open it and use getline to read a line at a time to find <a href=. However we never learned about parsing. Here is also a sample class he showed us that we can use
http://www.cs.sonoma.edu/~kooshesh/c...pleClasses.tar