-
Finding URLS
I have a program that now is able to find text within quotes, in an html file (special thanks to Hammer). Here is what I ended up with...
Code:
#include <stdio.h>
#include <stdlib.h>
#define SIZE 500000
#define INURL 1
#define NOT_INURL !INURL
int main(void)
{
int i = 0, b = 0, d = 0;
FILE *fp;
int matches = 0;
fp = fopen("c:\\blah.html", "r");
char html[SIZE];
char url[150][512];
//char *url_p;
char c;
int State, count = 0;
fread(html, sizeof(char), SIZE, fp);
//puts (html);
for (State = NOT_INURL; c = html[i]; i++)
{
if (c == '\"')
{
count++;
matches++;
State = !State;
continue;
}
if (State == INURL)
{
url[b][d] = c;
printf("%c", url[b][d]);
d++;
}
if (count == 2)
{
url[b][d] = '\0';
printf("\n");
b++;
count = 0;
}
}
printf("There are %d matches\n\n\n\n\n", matches / 2);
fclose(fp);
return(0);
}
However, this method doesn't work for URLS, it only works for strings within the quotes. Do any of you guys have some ideas on how I can single out the quotes? I was thinking of first searching for a quote, then, if the next four characters were "http", it would continue to stuff charcters into the array. Or, I could search through the array after it finishes finding all quoted text, and then search for "http," but I don't know the code for throwing out charcters of an array and having everything reordered, unless I rewrote the array and deleted the old one, which is inefficient. I was wondering if anybody has any ideas toward this dilemna, and I would really appreciate it if someone could throw some code ideas at me.
-
Well you toggle the state on the known start and end patterns of URL's rather than quotes.
> if (c == '\"')
Basically, change this.
-
But I still need to use the quotes. I want it to read in http and beyond, and then stop when it hits the second quote. Or I might just not understand what you are saying.
-
I mean your tests for 'in' and 'out' of a URL need to be more sophisticated than comparing a single character, but the method is essentially the same.
-
That's the problem, I do not know any other way to read in the file, except by character. I am wondering if there is any code for something like this....
If you hit quote, you are in the URL.
If the first character is 'h', keep going.
If the second character is 't', keep going (etc. up until "http").
The problem with that is, that when it hit a quote, it writes to the array. But if for example, the second character is not 't', how would I reset the array back to first index, so it can re-write?
-
That is the whole concept of a state machine, though a grammar would be equally effective.
You look at the character, and the state you're in and determine what your next state should be.
Eg.
Code:
enum states {
S_NONE,
S_TAG,
S_QUOTED_STRING, /* to hide say "this is how a url begins - http://" */
S_ESCAPE_CHAR,
S_COMMENT,
};
Then you basically have a switch statement for each
Code:
while ( (ch=fgetc(fp)) != EOF ) {
switch ( state ) {
case S_COMMENT:
if ( ch == '\n' ) state = S_NONE; // comment ends at a newline
break;
// etc
}
}
Though you'd probably want to make each state a separate function if life is going to get complicated.