C Board  

Go Back   C Board > Platform Specific Boards > Linux Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 08-20-2008, 01:37 AM   #1
Registered User
 
Join Date: Jul 2008
Posts: 17
Any regex people here?

I want to make a list of all links from an index.html page. All I want is the URL and I'll live with the assumption that there's double quotes around it - for now. Here's what I have so far:

Code:
#include <stdlib.h>
#include <string.h>
#include <regex.h>

#define MAX_STRING_SIZE 1024

int main() {
  int rc;
  regex_t * myregex = calloc(1, sizeof(regex_t));
  regmatch_t matches[3];
  FILE *fp;
  char line[MAX_STRING_SIZE];
  
  if(myregex == NULL)
     return 1;

  fp = fopen("/var/tmp/index.html", "r");
  rc = regcomp(myregex, "href\\s*=\\s*(\")*(.*?\")([^\"]+)", REG_EXTENDED);
  while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
     if(regexec(myregex, line, 3, matches, 0) == 0) {
        printf("String: %s\n", line + matches[2].rm_so);
     }
  }
  free(myregex);
  return 0;
}
My output shows me the URL, but it also shows me the rest of the line of text as well. I keep trying to tell it (I think) to stop after the first double quote after the URL, but no matter what I keep getting the rest of the line.

Thank you in advance for helping out.
traef06 is offline   Reply With Quote
Old 08-20-2008, 05:00 AM   #2
Cat without Hat
 
CornedBee's Avatar
 
Join Date: Apr 2003
Posts: 8,492
What's the last group for?
__________________
All the buzzt!
CornedBee

"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
CornedBee is offline   Reply With Quote
Old 08-20-2008, 05:31 AM   #3
Registered User
 
Join Date: Jul 2008
Posts: 17
It's for my lack of trying everything else I could possibly think of.

Really, I've stumbled so many times with regex. I've read and reread books and examples and everytime I come across a situation where they seem like the best strategy, I create them very carefully and with a lot of research, then they just never work.

Any suggestions are so welcome.
traef06 is offline   Reply With Quote
Old 08-20-2008, 06:13 AM   #4
Cat without Hat
 
CornedBee's Avatar
 
Join Date: Apr 2003
Posts: 8,492
"Sometimes when people have a problem, they think, 'I know, I'll use a regular expression.' Now they have two problems."

Can't remember where I got that quote from.

Anyway, this Perl syntax regex should pick up link HREFs. I'll leave the C string escaping to you.
Code:
href\s*=\s*['"]([^"']*)
I don't know about the POSIX regex API, since I've never used it.
__________________
All the buzzt!
CornedBee

"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
CornedBee is offline   Reply With Quote
Old 08-20-2008, 06:29 AM   #5
Registered User
 
Join Date: Jul 2008
Posts: 17
I tested that with Perl and yes it does work, however the POSIX API must be different because it doesn't work in C.

And I like your quote.

Is there any other method you'd use to scan through an html file looking for all links?

Thank you for your help.
traef06 is offline   Reply With Quote
Old 08-20-2008, 06:36 AM   #6
and the Hat of Ass
 
Join Date: Dec 2007
Posts: 811
Try:
Code:
href[:space:]*=[:space:]*['|"]([^"|']*)
Also, you probably shouldn't calloc your regex_t struct. Use a stack variable, pass a pointer to it to the functions, and use regfree when you're done:
Code:
regex_t re;
regcomp(&re, ...
regexec(&re, ...
regfree(&re);
rags_to_riches is offline   Reply With Quote
Old 08-20-2008, 06:37 AM   #7
Woof, woof!
 
zacs7's Avatar
 
Join Date: Mar 2007
Location: Australia
Posts: 3,295
I'd wack up a simple HTML parser, or use one of the hundreds out there...

Or simply search for <a ...></a>
__________________
"I.T. gets the chicky-babes" - M. Kelly
bakefile | vim
zacs7 is offline   Reply With Quote
Old 08-20-2008, 06:38 AM   #8
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
Quote:
Is there any other method you'd use to scan through an html file looking for all links?
Search for "href" as text (strstr for example), then look for symbol, equal sign and quotes - gather up what's between the quotes [you probably should remember what quote it started with and match it at the end - but I don't think quotes are valid within URL's, so I guess it won't make much difference].

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 08-20-2008, 06:49 AM   #9
Registered User
 
Join Date: Jul 2008
Posts: 17
rags_to_riches - I tried your regex and it still pulls everything to the end of the line. Should I use some non-greed qualifier? Like "?".

I originally thought about using strstr, but thought I should use regex and once and for all learn the darn thing.

I might go back to strstr with strtok as my method.

Thank you all for your ideas, comments and suggestions.

What did we all do before there were forums like this? I know I struggled for weeks before finding answers. This almost seems like cheating. Almost.

Thank you all.
traef06 is offline   Reply With Quote
Old 08-20-2008, 06:59 AM   #10
Cat without Hat
 
CornedBee's Avatar
 
Join Date: Apr 2003
Posts: 8,492
Use Boost.Regex (C++) or PCRE (C) to use Perl-syntax regular expressions. It's a lot saner than POSIX syntax, IMO.
__________________
All the buzzt!
CornedBee

"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
CornedBee is offline   Reply With Quote
Old 08-20-2008, 07:07 AM   #11
Registered User
 
Join Date: Jul 2008
Posts: 17
Thanks CornedBee.

I'll search for how to use PCRE(C) and see if it works easier than what I'm using now. It's really been frustrating.
traef06 is offline   Reply With Quote
Old 08-20-2008, 07:14 AM   #12
and the Hat of Ass
 
Join Date: Dec 2007
Posts: 811
It's because this line:
Code:
printf("String: %s\n", line + matches[2].rm_so);
is wrong. You are not taking into account the end offset of the match (rm_eo). This works:
Code:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <regex.h>

#define MAX_STRING_SIZE 1024

int main() {
   int rc;
   regex_t myregex;
   regmatch_t matches[2];
   FILE *fp;
   char line[MAX_STRING_SIZE];
   char match[MAX_STRING_SIZE];

   if (NULL != (fp = fopen("./index.html", "r"))) {
      rc = regcomp(&myregex,
                   "href[:space:]*=[:space:]*['|\"]([^\"|']*)",
                   REG_EXTENDED);
      while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
         if(regexec(&myregex, line, 2, matches, 0) == 0) {
            if (matches[1].rm_so != -1) {
               size_t match_len = matches[1].rm_eo - matches[1].rm_so;
               strncpy(match,
                       &line[matches[1].rm_so],
                       match_len);
               match[match_len] = '\0';
               printf("String: %s\n", match);
            }
         }
      }
      fclose(fp);
   }
   regfree(&myregex);
   return 0;
}

Last edited by rags_to_riches; 08-20-2008 at 07:26 AM. Reason: File pointer validation and closure
rags_to_riches is offline   Reply With Quote
Old 08-20-2008, 07:38 AM   #13
Registered User
 
Join Date: Jul 2008
Posts: 17
rags_to_riches - you are fantastic!

Thank you for your help. (help nothing. you did it for me)

I had tried it with the ending offset but I did my math in reverse so I got some weird characters all over my screen so I knew I went into some deep memory area.

Hopefully this thread gets indexed on all the search engines because your code will definitely be useful to many, many others.

Thank you again.
traef06 is offline   Reply With Quote
Old 08-21-2008, 11:21 AM   #14
Registered User
 
Join Date: Jul 2008
Posts: 17
Is there a way to grab all href's on one line?

I'm coming across one line of html that has multiple href's on it and this is only collectinng the first one.

Any ideas?

It's worth $100 for the help. (not to me, to you)
traef06 is offline   Reply With Quote
Old 08-21-2008, 11:31 AM   #15
Cat without Hat
 
CornedBee's Avatar
 
Join Date: Apr 2003
Posts: 8,492
It's simple. When you find a match on a line, take the substring starting at the end offset and try again, until you no longer find matches.
__________________
All the buzzt!
CornedBee

"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
CornedBee is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
<regex.h> regex syntax in C battersausage C Programming 7 03-24-2004 01:35 PM
God datainjector A Brief History of Cprogramming.com 746 12-22-2002 12:01 PM
I'm worried about some of the people wanting to program... damonbrinkley A Brief History of Cprogramming.com 14 11-23-2002 07:38 AM
Language nvoigt A Brief History of Cprogramming.com 19 04-29-2002 02:28 PM
How is regex used? Strider C++ Programming 0 12-14-2001 08:15 AM


All times are GMT -6. The time now is 10:35 PM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22