Thread: How to extract url from strings

  1. #1
    Registered User
    Join Date
    Aug 2007
    Posts
    32

    How to extract url from strings

    Hi,
    With the below function i can extract urls from strings. But it works with only "http://". So how can i change it to extract urls which begins with "www" also.

    Thanks in advance.

    Code owner: James K. Lawless
    Code:
    void print_urls(char *s) {
       char *p,*mark;
       _state=0;
       for(p=s;*p;p++) {
          switch(_state) {
             case 0:
                if(*p=='h') {
                   _state=S_h;
                   mark=p;
                } 
                break;
             case S_h:
                if(*p=='t') 
                   _state=S_t1;
                else
                   _state=0;
                break;
             case S_t1:
                if(*p=='t') 
                   _state=S_t2;
                else
                   _state=0;
                break;
             case S_t2:
                if(*p=='p') 
                   _state=S_p;
                else
                   _state=0;
                break;
             case S_p:
                if(*p==':') 
                   _state=S_col;
                else
                if(*p=='s')
                   _state=S_s;
                else
                   _state=0;
                break;
             case S_s:
                if(*p==':')
                   _state=S_col;
                else
                   _state=0;
                break;
             case S_col:
                if(strchr(legal_chars,tolower(*p))==NULL) {
                   while(mark<p) {
                      fputc(*mark,stdout);
                      mark++;
                   }
                   fputc('\n',stdout);
                   _state=0;
                   p--; // backtrack
                } 
          }
       }
       if(_state) {
          while(mark<p) {
             fputc(*mark,stdout);
             mark++;
          }
       }
    }

  2. #2
    Registered User
    Join Date
    Aug 2007
    Posts
    32
    Sorry for being such a douche and double post. Since i wrote something new, i wanted to show you.


    Code:
    #include "string.h"
    #include "stdio.h"
    #define BAD(x) (!(x) || (*(x) == '\0'))
    
    static char *extract_link(char string[])
    {
            static char url[256];
            int length;
            char *st;
            char *rl;
            url[0] = '\0';
            rl = strstr(string, "http://");
            if (BAD(rl))
                    goto www;
            if (!BAD(rl) && strchr(rl, ' '))
            {
                    st = strstr(rl, " ");
                    length = strlen(rl) - strlen(st);
                    strncpy(url, rl, length);
                    return url;
            }
            else
                    return rl;
            www:
                    rl = strstr(string, "www.");
                    if (BAD(rl))
                            return NULL;
                    if (!BAD(rl) && strchr(rl, ' '))
                    {
                            st = strstr(rl, " ");
                            length = strlen(rl) - strlen(st);
                            strncpy(url, rl, length);
                            return url;
                    }
                    else
                            return rl;
    }
    int main()
    {
            char string1[] = "We have a http://www.google.com here";
            char string2[] = "We have a www.google.com/blabla here whatsoever";
            char *x = extract_link(string2);
            char *y = extract_link(string2);
            printf("%s\n%s\n", x, y);
            return 0;
    }
    It works with just one string.But if i make an another call with new string to extract_link function (like above) the urls are being mixed and shown as the same.
    How can i fix this problem?

  3. #3
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Yes. Because url[256] is a local buffer in the function whose contents disappears once the function exits.
    You should reserve space for returned string...
    char x[256], y[256] before you call the function. Then call it
    extract_link(string1, x)... The second parameter being used within the function to strcpy the result there.
    Oh and you called extract_link with string2 twice.

  4. #4
    Registered User
    Join Date
    Aug 2007
    Posts
    32
    Thank you. (using the string 'string2' twice was my mistake.)

    I changed the codes as

    Code:
    #include "string.h"
    #include "stdio.h"
    #define BAD(x) (!(x) || (*(x) == '\0'))
    
    static char *extract_link(char *string, char url[512])
    {
            int length;
            char *st;
            char *rl;
            url[0] = '\0';
            rl = strstr(string, "http://");
            if (BAD(rl))
                    goto www;
            if (!BAD(rl) && strchr(rl, ' '))
            {
                    st = strstr(rl, " ");
                    length = strlen(rl) - strlen(st);
                    strncpy(url, rl, length);
                    return url;
            }
            else
                    return rl;
            www:
                    rl = strstr(string, "www.");
                    if (BAD(rl))
                            return NULL;
                    if (!BAD(rl) && strchr(rl, ' '))
                    {
                            st = strstr(rl, " ");
                            length = strlen(rl) - strlen(st);
                            strncpy(url, rl, length);
                            return url;
                    }
                    else
                            return rl;
    }
    int main()
    {
            char x[256], y[256];
            char *string1 = "We have a http://www.google.com here and there";
            char *string2 = "We have a www.google.com/blabla here whatsoever";
            extract_link(string1, x);
            extract_link(string2, y);
            printf("%s\n%s\n", x, y);
            return 0;
    }
    Still no avail..

    Output:

  5. #5
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Proper strings have a trailing null.
    After both of the strncpy(), put url[length] = '\0';
    The function no longer needs to return anything so get rid of the returns. Also make its header
    static void extract_link(char *string, char *url)
    Last edited by nonoob; 04-07-2011 at 01:14 PM.

  6. #6
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Because you are not allocating memory for your strings.

    What happens to rl and url when your function returns?

    Code:
    char link[512];
    
    // function
    void ExtractUrl(char *in, char *out)
      { 
         // figure out the url from the in variable
    
         strncpy(out,url,511);   // copy and return
    }

  7. #7
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Woah, late to the party, CommonTater.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. URL splitter
    By monki000 in forum C Programming
    Replies: 10
    Last Post: 02-18-2010, 12:36 AM
  2. Replies: 2
    Last Post: 08-29-2008, 05:29 AM
  3. Replies: 10
    Last Post: 06-10-2008, 02:17 AM
  4. Replies: 5
    Last Post: 12-21-2007, 01:38 PM
  5. Strings Program
    By limergal in forum C++ Programming
    Replies: 4
    Last Post: 12-02-2006, 03:24 PM