![]() |
| | #1 |
| Registered User Join Date: Jul 2008
Posts: 17
| Any regex people here? Code: #include <stdlib.h>
#include <string.h>
#include <regex.h>
#define MAX_STRING_SIZE 1024
int main() {
int rc;
regex_t * myregex = calloc(1, sizeof(regex_t));
regmatch_t matches[3];
FILE *fp;
char line[MAX_STRING_SIZE];
if(myregex == NULL)
return 1;
fp = fopen("/var/tmp/index.html", "r");
rc = regcomp(myregex, "href\\s*=\\s*(\")*(.*?\")([^\"]+)", REG_EXTENDED);
while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
if(regexec(myregex, line, 3, matches, 0) == 0) {
printf("String: %s\n", line + matches[2].rm_so);
}
}
free(myregex);
return 0;
}
Thank you in advance for helping out. |
| traef06 is offline | |
| | #2 |
| Cat without Hat Join Date: Apr 2003
Posts: 8,492
| What's the last group for?
__________________ All the buzzt! CornedBee"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code." - Flon's Law |
| CornedBee is offline | |
| | #3 |
| Registered User Join Date: Jul 2008
Posts: 17
| It's for my lack of trying everything else I could possibly think of. Really, I've stumbled so many times with regex. I've read and reread books and examples and everytime I come across a situation where they seem like the best strategy, I create them very carefully and with a lot of research, then they just never work. Any suggestions are so welcome. |
| traef06 is offline | |
| | #4 |
| Cat without Hat Join Date: Apr 2003
Posts: 8,492
| "Sometimes when people have a problem, they think, 'I know, I'll use a regular expression.' Now they have two problems." Can't remember where I got that quote from. Anyway, this Perl syntax regex should pick up link HREFs. I'll leave the C string escaping to you. Code: href\s*=\s*['"]([^"']*)
__________________ All the buzzt! CornedBee"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code." - Flon's Law |
| CornedBee is offline | |
| | #5 |
| Registered User Join Date: Jul 2008
Posts: 17
| I tested that with Perl and yes it does work, however the POSIX API must be different because it doesn't work in C. And I like your quote. Is there any other method you'd use to scan through an html file looking for all links? Thank you for your help. |
| traef06 is offline | |
| | #6 |
| and the Hat of Ass Join Date: Dec 2007
Posts: 811
| Try: Code: href[:space:]*=[:space:]*['|"]([^"|']*) Code: regex_t re; regcomp(&re, ... regexec(&re, ... regfree(&re); |
| rags_to_riches is offline | |
| | #8 | |
| Kernel hacker Join Date: Jul 2007 Location: Farncombe, Surrey, England
Posts: 15,686
| Quote:
-- Mats
__________________ Compilers can produce warnings - make the compiler programmers happy: Use them! Please don't PM me for help - and no, I don't do help over instant messengers. | |
| matsp is offline | |
| | #9 |
| Registered User Join Date: Jul 2008
Posts: 17
| rags_to_riches - I tried your regex and it still pulls everything to the end of the line. Should I use some non-greed qualifier? Like "?". I originally thought about using strstr, but thought I should use regex and once and for all learn the darn thing. I might go back to strstr with strtok as my method. Thank you all for your ideas, comments and suggestions. What did we all do before there were forums like this? I know I struggled for weeks before finding answers. This almost seems like cheating. Almost. Thank you all. |
| traef06 is offline | |
| | #10 |
| Cat without Hat Join Date: Apr 2003
Posts: 8,492
| Use Boost.Regex (C++) or PCRE (C) to use Perl-syntax regular expressions. It's a lot saner than POSIX syntax, IMO.
__________________ All the buzzt! CornedBee"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code." - Flon's Law |
| CornedBee is offline | |
| | #11 |
| Registered User Join Date: Jul 2008
Posts: 17
| Thanks CornedBee. I'll search for how to use PCRE(C) and see if it works easier than what I'm using now. It's really been frustrating. |
| traef06 is offline | |
| | #12 |
| and the Hat of Ass Join Date: Dec 2007
Posts: 811
| It's because this line: Code: printf("String: %s\n", line + matches[2].rm_so);
Code: #include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <regex.h>
#define MAX_STRING_SIZE 1024
int main() {
int rc;
regex_t myregex;
regmatch_t matches[2];
FILE *fp;
char line[MAX_STRING_SIZE];
char match[MAX_STRING_SIZE];
if (NULL != (fp = fopen("./index.html", "r"))) {
rc = regcomp(&myregex,
"href[:space:]*=[:space:]*['|\"]([^\"|']*)",
REG_EXTENDED);
while(fgets(line, MAX_STRING_SIZE, fp) != NULL) {
if(regexec(&myregex, line, 2, matches, 0) == 0) {
if (matches[1].rm_so != -1) {
size_t match_len = matches[1].rm_eo - matches[1].rm_so;
strncpy(match,
&line[matches[1].rm_so],
match_len);
match[match_len] = '\0';
printf("String: %s\n", match);
}
}
}
fclose(fp);
}
regfree(&myregex);
return 0;
}
Last edited by rags_to_riches; 08-20-2008 at 07:26 AM. Reason: File pointer validation and closure |
| rags_to_riches is offline | |
| | #13 |
| Registered User Join Date: Jul 2008
Posts: 17
| rags_to_riches - you are fantastic! Thank you for your help. (help nothing. you did it for me) I had tried it with the ending offset but I did my math in reverse so I got some weird characters all over my screen so I knew I went into some deep memory area. Hopefully this thread gets indexed on all the search engines because your code will definitely be useful to many, many others. Thank you again. |
| traef06 is offline | |
| | #14 |
| Registered User Join Date: Jul 2008
Posts: 17
| Is there a way to grab all href's on one line? I'm coming across one line of html that has multiple href's on it and this is only collectinng the first one. Any ideas? It's worth $100 for the help. (not to me, to you) |
| traef06 is offline | |
| | #15 |
| Cat without Hat Join Date: Apr 2003
Posts: 8,492
| It's simple. When you find a match on a line, take the substring starting at the end offset and try again, until you no longer find matches.
__________________ All the buzzt! CornedBee"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code." - Flon's Law |
| CornedBee is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| <regex.h> regex syntax in C | battersausage | C Programming | 7 | 03-24-2004 01:35 PM |
| God | datainjector | A Brief History of Cprogramming.com | 746 | 12-22-2002 12:01 PM |
| I'm worried about some of the people wanting to program... | damonbrinkley | A Brief History of Cprogramming.com | 14 | 11-23-2002 07:38 AM |
| Language | nvoigt | A Brief History of Cprogramming.com | 19 | 04-29-2002 02:28 PM |
| How is regex used? | Strider | C++ Programming | 0 | 12-14-2001 08:15 AM |