Thread: Any regex people here?

  1. #1
    Registered User
    Join Date
    Jul 2008
    Posts
    17

    Any regex people here?

    I want to make a list of all links from an index.html page. All I want is the URL and I'll live with the assumption that there's double quotes around it - for now. Here's what I have so far:

    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <regex.h>
    
    #define MAX_STRING_SIZE 1024
    
    int main() {
      int rc;
      regex_t * myregex = calloc(1, sizeof(regex_t));
      regmatch_t matches[3];
      FILE *fp;
      char line[MAX_STRING_SIZE];
      
      if(myregex == NULL)
         return 1;
    
      fp = fopen("/var/tmp/index.html", "r");
      rc = regcomp(myregex, "href\\s*=\\s*(\")*(.*?\")([^\"]+)", REG_EXTENDED);
      while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
         if(regexec(myregex, line, 3, matches, 0) == 0) {
            printf("String: %s\n", line + matches[2].rm_so);
         }
      }
      free(myregex);
      return 0;
    }
    My output shows me the URL, but it also shows me the rest of the line of text as well. I keep trying to tell it (I think) to stop after the first double quote after the URL, but no matter what I keep getting the rest of the line.

    Thank you in advance for helping out.

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    What's the last group for?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    It's for my lack of trying everything else I could possibly think of.

    Really, I've stumbled so many times with regex. I've read and reread books and examples and everytime I come across a situation where they seem like the best strategy, I create them very carefully and with a lot of research, then they just never work.

    Any suggestions are so welcome.

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    "Sometimes when people have a problem, they think, 'I know, I'll use a regular expression.' Now they have two problems."

    Can't remember where I got that quote from.

    Anyway, this Perl syntax regex should pick up link HREFs. I'll leave the C string escaping to you.
    Code:
    href\s*=\s*['"]([^"']*)
    I don't know about the POSIX regex API, since I've never used it.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    I tested that with Perl and yes it does work, however the POSIX API must be different because it doesn't work in C.

    And I like your quote.

    Is there any other method you'd use to scan through an html file looking for all links?

    Thank you for your help.

  6. #6
    Registered User
    Join Date
    Dec 2007
    Posts
    2,675
    Try:
    Code:
    href[:space:]*=[:space:]*['|"]([^"|']*)
    Also, you probably shouldn't calloc your regex_t struct. Use a stack variable, pass a pointer to it to the functions, and use regfree when you're done:
    Code:
    regex_t re;
    regcomp(&re, ...
    regexec(&re, ...
    regfree(&re);

  7. #7
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    I'd wack up a simple HTML parser, or use one of the hundreds out there...

    Or simply search for <a ...></a>

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Is there any other method you'd use to scan through an html file looking for all links?
    Search for "href" as text (strstr for example), then look for symbol, equal sign and quotes - gather up what's between the quotes [you probably should remember what quote it started with and match it at the end - but I don't think quotes are valid within URL's, so I guess it won't make much difference].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #9
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    rags_to_riches - I tried your regex and it still pulls everything to the end of the line. Should I use some non-greed qualifier? Like "?".

    I originally thought about using strstr, but thought I should use regex and once and for all learn the darn thing.

    I might go back to strstr with strtok as my method.

    Thank you all for your ideas, comments and suggestions.

    What did we all do before there were forums like this? I know I struggled for weeks before finding answers. This almost seems like cheating. Almost.

    Thank you all.

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Use Boost.Regex (C++) or PCRE (C) to use Perl-syntax regular expressions. It's a lot saner than POSIX syntax, IMO.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    Thanks CornedBee.

    I'll search for how to use PCRE(C) and see if it works easier than what I'm using now. It's really been frustrating.

  12. #12
    Registered User
    Join Date
    Dec 2007
    Posts
    2,675
    It's because this line:
    Code:
    printf("String: &#37;s\n", line + matches[2].rm_so);
    is wrong. You are not taking into account the end offset of the match (rm_eo). This works:
    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <regex.h>
    
    #define MAX_STRING_SIZE 1024
    
    int main() {
       int rc;
       regex_t myregex;
       regmatch_t matches[2];
       FILE *fp;
       char line[MAX_STRING_SIZE];
       char match[MAX_STRING_SIZE];
    
       if (NULL != (fp = fopen("./index.html", "r"))) {
          rc = regcomp(&myregex,
                       "href[:space:]*=[:space:]*['|\"]([^\"|']*)",
                       REG_EXTENDED);
          while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
             if(regexec(&myregex, line, 2, matches, 0) == 0) {
                if (matches[1].rm_so != -1) {
                   size_t match_len = matches[1].rm_eo - matches[1].rm_so;
                   strncpy(match,
                           &line[matches[1].rm_so],
                           match_len);
                   match[match_len] = '\0';
                   printf("String: %s\n", match);
                }
             }
          }
          fclose(fp);
       }
       regfree(&myregex);
       return 0;
    }
    Last edited by rags_to_riches; 08-20-2008 at 07:26 AM. Reason: File pointer validation and closure

  13. #13
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    rags_to_riches - you are fantastic!

    Thank you for your help. (help nothing. you did it for me)

    I had tried it with the ending offset but I did my math in reverse so I got some weird characters all over my screen so I knew I went into some deep memory area.

    Hopefully this thread gets indexed on all the search engines because your code will definitely be useful to many, many others.

    Thank you again.

  14. #14
    Registered User
    Join Date
    Jul 2008
    Posts
    17
    Is there a way to grab all href's on one line?

    I'm coming across one line of html that has multiple href's on it and this is only collectinng the first one.

    Any ideas?

    It's worth $100 for the help. (not to me, to you)

  15. #15
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    It's simple. When you find a match on a line, take the substring starting at the end offset and try again, until you no longer find matches.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. <regex.h> regex syntax in C
    By battersausage in forum C Programming
    Replies: 7
    Last Post: 03-24-2004, 01:35 PM
  2. God
    By datainjector in forum A Brief History of Cprogramming.com
    Replies: 746
    Last Post: 12-22-2002, 12:01 PM
  3. I'm worried about some of the people wanting to program...
    By damonbrinkley in forum A Brief History of Cprogramming.com
    Replies: 14
    Last Post: 11-23-2002, 07:38 AM
  4. Language
    By nvoigt in forum A Brief History of Cprogramming.com
    Replies: 19
    Last Post: 04-29-2002, 02:28 PM
  5. How is regex used?
    By Strider in forum C++ Programming
    Replies: 0
    Last Post: 12-14-2001, 08:15 AM