How would one begin?

Printable View

09-11-2005
Mr_Acclude

How would one begin?

How would one even try to begin attempting this problem? Is their any places that I can look that can help me attempt this problem?

Given a number of web-pages, you are to develop a tool to create a cross-reference for the links in the pages. Specifically, for each web-page, you will generate two lists
1. The list of files that it references.
2. The list of files that reference it.

One can use a cross-reference like this to
-identify those files that that are isolated; that is, no other file refers to them. (Note that many times ``index.html'' is an isolate file, even though it is a very useful file!)

-for each file on the list, determine which files/URLs it references. This is useful to make sure that all these files still exist.

-for each file on the list, determine all files/URLs that refer to a given file. This is useful when the web-master decides to either elimitate a page or merge it into another page.
09-11-2005
JaWiB

Well the first part would probably be pretty easy, except maybe parsing an html file (I guess you'd just look for "<a href=") Once you figure out how to find all the pages that it links to, write those to a file.

You'd probably do that for each page first, then go back for each page, look through all the files you just created to see if they have link to that page.
09-11-2005
Mr. Acclude

Yea, were studying recurision so I guess thats what I have to do.
09-13-2005
Mr. Acclude

I know how to open an indivdual file. But the teacher includes like 10 html files into one folder. So how do I open multiple files, with out even knowing the names of these html files? Is that even possible.

----------------------------------------------------------------------------------
The input to this program is the name of a file that contains all files that your program should use to generate the cross-reference. For example, suppose the input file contains the following names.

index.html
faculty.html

Your program will visit each of these files and for each file, it will find all ``anchor'' tags and enumerates all file-names that they reference. For example, suppose ``faculty.html'' contains:

<thml><head><title>Departmental Faculty</title></head>
<body>
<H3>Professor</H3>
<a href="ledin.html">George Ledin</a>
<a href="stauffer.html">Lynn Stauffer</a>

...

</body>
</html>

Then, your program should provide the following lists:
faculty.html refers to
ledin.html
stauffer.html

faculty.html is refered to by
index.html

Note I have assumed that "index.html" through an anchor tag refers to "faculty.html"
09-13-2005
JaWiB

Someone correct me if I'm wrong, but I don't believe there is a standard way to find all the files in a folder (meaning you would have to write OS-Specific code). You can look here for a solution for windows.
09-13-2005
Mr. Acclude

Actually this is going to be written in Linux. Here is the project description http://www.cs.sonoma.edu/~kooshesh/c...5project1.html

If it was just one file, I would be able to open it and use getline to read a line at a time to find <a href=. However we never learned about parsing. Here is also a sample class he showed us that we can use

http://www.cs.sonoma.edu/~kooshesh/c...pleClasses.tar