Thread: Getting links from html page

  1. #1
    Registered User
    Join Date
    Sep 2007
    Posts
    5

    Getting links from html page

    Hey all , i am new here.
    I have a problem with the following code :
    Code:
    // buf holds the raw html line
    char buf[1000];
    // this stores 1 link
    char link[1000];
    char *start = strstr(buf, "<a href");
    start = strchr(start+1, '>')+1;
    char *end = strstr(buf, "</a>");
    memset(link, 0, 1000);
    memcpy(start, link, end-start)
    I want to put in link[] the html links from the table buf .
    I am doing something wrong , can someone help me

  2. #2
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    FYI, It doesn't have to be '<a href=' in that order, it could be, '<a name... href=' So your method of extracting links is flawed.

    Consider searching for '<a' then finding the first 'href=' from '<a' until a closing '>' It doesn't matter to you if they closed the 'a' tag or not.

    Remember there is a lot of invalid HTML, some people use quotes, some people don't. You should be able to handle, ' " and no quotes.

    ie:
    Code:
    <a href=http://google.com>
    <a href='http://google.com'>
    <a href="http://google.com">
    Consider re-thinking your approach, what exactly are you having problems with?

    Also try not to hard code the array size into memset, use sizeof(link) in place of that 1000.

  3. #3
    Registered User
    Join Date
    Sep 2007
    Posts
    5
    The problem is that even with my implementation i think that something should be in the link table. but the array is full empty . If i could save even a part of the link then i could set up properly the html tags so i get the exact part i want .
    I hope to became more clear to what i want to do.

  4. #4
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    Maybe post a larger section of code? Like when you actually fill 'buf' and such... If it's too large use pastebin or something.

  5. #5
    Registered User
    Join Date
    Sep 2007
    Posts
    5
    I am doing a get from google search pages and the code is something linke this :
    Code:
    //There's code  before	
    while ((read(sockfd,buf , sizeof(buf)-1))) {
    	
    	clean_buf(buf);
    }
    void clean_buf(char *buf){
    // buf holds the raw html line
    char buf[1000];
    
    // this stores 1 link
    char link[1000];
    
    
    char *start = strstr(buf, "<a href");
    start = strchr(start+1, '>')+1;
    char *end = strstr(buf, "</a>");
    
    memset(link, 0, 1000);
    memcpy(start, link, end-start);
    printf("%s\n",link); //Here is where i want to print the links 
    }

  6. #6
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    If you're not forced to use C then I'd suggest you get on the perl bandwagon for something like this.

    Otherwise, you've got a little bit of logic problem, if you get an '<a' in say the 999th spot of that array then you'll have a problem.

    * You're not null-terminating the string properly... ie if the link is 1000 characters you'll have a problem
    * Links longer than 1000 characters will break your program, not to mention a potential segfault if </a> is more than 1000 characters away from <a href=

    Consider:
    * Allocating memory for the entire page, ie resize a larger buffer each time you recv() and add 'buf' to it
    * Introduce a state-machine and get characters 1-by-1 from the network stream (introduces inefficiency problems)

    There are probably a few other ways to go around it.

    Note: There are several advantages to Allocating memory for the whole page, those include you could just replace </a> with a NUL terminator and have an array of pointers, pointing to the start of each link (ie href=) or something.
    Last edited by zacs7; 09-30-2007 at 06:16 AM.

  7. #7
    Registered User
    Join Date
    Sep 2007
    Posts
    5
    First of all thanks for your interest.
    Would it be better to use a struct to save the buf and the links and use DMA for the tables ?
    Ps: I have to do it with C , i know that i would be easier and faster with Perl .
    Last edited by g0_sh; 09-30-2007 at 06:17 AM.

  8. #8
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    Not really, I'd just have a large DMA'd buffer which I resized when I got data from recv(). I'd then append the 'buf' from recv() to the larger DMA'd buffer.

    After I had finished recv()'ing the whole page, I'd then scan it for links creating an array of pointers (or a struct with pointers) which point to the links, links which I nul-terminated by replacing '</a>' with '\0'. Then I'd print the links and free the DMA'd buffer.

    ie:
    DMA buffer before parsing:
    Code:
    <html>
    <head>
    <title>Google</title>
    </head>
    <body>
    <a href="http://google.com.au/">Google Link</a>
    <a href="http://yahoo.com/">Yahoo link</a>
    </body>
    </html>
    And after:
    Code:
    <html>
    <head>
    <title>Google</title>
    </head>
    <body>
    <a href="Phttp://google.com.au/0>PGoogle Link0/a>
    <a href="Phttp://yahoo.com/0>PYahoo link0/a>
    </body>
    </html>
    Where 0 = NUL-Terminator and P is a pointer, (ie keep the addess in a table, struct, array whatever.)

    Then you only have to keep pointers, no more sizing memory or link length limit:

    eg:
    Code:
    typedef struct link_t {
        char * title;
        char * url;
    } link;
    You could then keep an array of 'links' and DMA / resize it at will.
    Last edited by zacs7; 09-30-2007 at 06:29 AM.

  9. #9
    Registered User
    Join Date
    Sep 2007
    Posts
    5
    Thanks a lot man for your advise.
    I 'll try it right now :P

  10. #10
    Registered User
    Join Date
    Sep 2007
    Posts
    11
    too bad u have to do it with C ... well
    mmm.. javascript:alert(document.links[0]) ... if you could get access this links array from C I think it would be good..
    would be amusingly easier in javascript .. well it works with client side files... such as if you save the google source to your hard drive.. so you would get local file:// type links
    if they use relative pathnames .. if not this should work. of course like person said before if its bad code thats a different story.
    Code:
    <script type="text/javascript">
    for (var x=0;x<document.links.length;x++)
    document.write(document.links[x]+ "<br />");
    </script>

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Please Help - C code creates dynamic HTML
    By Christie2008 in forum C Programming
    Replies: 19
    Last Post: 04-02-2008, 07:36 PM
  2. Trying to grab the HTML from a Page...
    By rloveless in forum Networking/Device Communication
    Replies: 6
    Last Post: 05-05-2007, 01:03 AM
  3. I need to open a web page from c++ and grab the html.
    By rloveless in forum C++ Programming
    Replies: 1
    Last Post: 09-28-2006, 04:12 PM
  4. HTML page split into files in c
    By Munisamy in forum C Programming
    Replies: 2
    Last Post: 02-21-2005, 05:58 AM
  5. Downloading HTML Files from Web Page
    By Unregistered in forum A Brief History of Cprogramming.com
    Replies: 13
    Last Post: 07-18-2002, 05:59 AM