Getting links from html page

**g0_sh** · 09-30-2007

Hey all , i am new here.
I have a problem with the following code :

Code:

// buf holds the raw html line
char buf[1000];
// this stores 1 link
char link[1000];
char *start = strstr(buf, "<a href");
start = strchr(start+1, '>')+1;
char *end = strstr(buf, "</a>");
memset(link, 0, 1000);
memcpy(start, link, end-start)

I want to put in link[] the html links from the table buf .
I am doing something wrong , can someone help me

**zacs7** · 09-30-2007

FYI, It doesn't have to be '<a href=' in that order, it could be, '<a name... href=' So your method of extracting links is flawed.

Consider searching for '<a' then finding the first 'href=' from '<a' until a closing '>' It doesn't matter to you if they closed the 'a' tag or not.

Remember there is a lot of invalid HTML, some people use quotes, some people don't. You should be able to handle, ' " and no quotes.

ie:

Code:

<a href=http://google.com>
<a href='http://google.com'>
<a href="http://google.com">

Consider re-thinking your approach, what exactly are you having problems with?

Also try not to hard code the array size into memset, use sizeof(link) in place of that 1000.

**g0_sh** · 09-30-2007

The problem is that even with my implementation i think that something should be in the link table. but the array is full empty . If i could save even a part of the link then i could set up properly the html tags so i get the exact part i want .
I hope to became more clear to what i want to do.

**zacs7** · 09-30-2007

Maybe post a larger section of code? Like when you actually fill 'buf' and such... If it's too large use pastebin or something.

**g0_sh** · 09-30-2007

I am doing a get from google search pages and the code is something linke this :

Code:

//There's code  before	
while ((read(sockfd,buf , sizeof(buf)-1))) {
	
	clean_buf(buf);
}
void clean_buf(char *buf){
// buf holds the raw html line
char buf[1000];

// this stores 1 link
char link[1000];


char *start = strstr(buf, "<a href");
start = strchr(start+1, '>')+1;
char *end = strstr(buf, "</a>");

memset(link, 0, 1000);
memcpy(start, link, end-start);
printf("%s\n",link); //Here is where i want to print the links 
}

**zacs7** · 09-30-2007

If you're not forced to use C then I'd suggest you get on the perl bandwagon for something like this.

Otherwise, you've got a little bit of logic problem, if you get an '<a' in say the 999th spot of that array then you'll have a problem.

* You're not null-terminating the string properly... ie if the link is 1000 characters you'll have a problem
* Links longer than 1000 characters will break your program, not to mention a potential segfault if </a> is more than 1000 characters away from <a href=

Consider:
* Allocating memory for the entire page, ie resize a larger buffer each time you recv() and add 'buf' to it
* Introduce a state-machine and get characters 1-by-1 from the network stream (introduces inefficiency problems)

There are probably a few other ways to go around it.

Note: There are several advantages to Allocating memory for the whole page, those include you could just replace </a> with a NUL terminator and have an array of pointers, pointing to the start of each link (ie href=) or something.

**g0_sh** · 09-30-2007

First of all thanks for your interest.
Would it be better to use a struct to save the buf and the links and use DMA for the tables ?
Ps: I have to do it with C , i know that i would be easier and faster with Perl .

**zacs7** · 09-30-2007

Not really, I'd just have a large DMA'd buffer which I resized when I got data from recv(). I'd then append the 'buf' from recv() to the larger DMA'd buffer.

After I had finished recv()'ing the whole page, I'd then scan it for links creating an array of pointers (or a struct with pointers) which point to the links, links which I nul-terminated by replacing '</a>' with '\0'. Then I'd print the links and free the DMA'd buffer.

ie:
DMA buffer before parsing:

Code:

<html>
<head>
<title>Google</title>
</head>
<body>
<a href="http://google.com.au/">Google Link</a>
<a href="http://yahoo.com/">Yahoo link</a>
</body>
</html>

And after:

Code:

<html>
<head>
<title>Google</title>
</head>
<body>
<a href="Phttp://google.com.au/0>PGoogle Link0/a>
<a href="Phttp://yahoo.com/0>PYahoo link0/a>
</body>
</html>

Where 0 = NUL-Terminator and P is a pointer, (ie keep the addess in a table, struct, array whatever.)

Then you only have to keep pointers, no more sizing memory or link length limit:

eg:

Code:

typedef struct link_t {
    char * title;
    char * url;
} link;

You could then keep an array of 'links' and DMA / resize it at will.

**g0_sh** · 09-30-2007

Thanks a lot man for your advise.
I 'll try it right now :P

**xsouldeath** · 09-30-2007

too bad u have to do it with C ... well
mmm.. javascript:alert(document.links[0]) ... if you could get access this links array from C I think it would be good..
would be amusingly easier in javascript

.. well it works with client side files... such as if you save the google source to your hard drive.. so you would get local file:// type links
if they use relative pathnames .. if not this should work. of course like person said before if its bad code thats a different story.

Code:

<script type="text/javascript">
for (var x=0;x<document.links.length;x++)
document.write(document.links[x]+ "<br />");
</script>

Thread: Getting links from html page

Thread Tools

Search Thread

Display

Getting links from html page

Similar Threads

Please Help - C code creates dynamic HTML

Trying to grab the HTML from a Page...

I need to open a web page from c++ and grab the html.

HTML page split into files in c

Downloading HTML Files from Web Page