Thread: Need regexp help

  1. #1
    Registered User
    Join Date
    Sep 2008
    Posts
    10

    Need regexp help

    I'm trying to write a small C program that takes a file containing this:

    <mytag> "HELLO" </mytag> asdfasdfadsfadsf <mytag> "WORLD" </mytag>

    ... and fetches HELLO and HELLO using regular expressions.

    I have everything set up in place, this is the regexp I'm using:

    <mytag>.*(HELLO).*</mytag>

    The problem is that this matches the ENTIRE string. So I got the tip to use lazy quantifiers, like so:

    <mytag>.*?(HELLO).*?</mytag>

    which *would* work if the regexp engine was regexp directed, not text directed. But sadly, it does not.

    So my question, which I'd be very happy to have answered, is how do I write a regexp that matches, with HELLO as a substring, the above and just the first part of the string?

    Does the question make sense?

    TIA!

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Well, who knows what regexp engine you're using (hint: not us). And does the engine not have documentation?

  3. #3
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    Talking

    Since "HELLO" is really an exact match (unlike "H.?LLO"), you might as well just iterate through your line one character at a time. But I presume your wanted substring is actually something else, or you wouldn't be asking this question -- unless you just want to check if HELLO is present.

    This is more or less what I've been using:
    Code:
    #include <regex.h>
    
    struct matchspec {
    	int bgn;
    	int end;
    } rgxp;
    
    char *regexp (char *string, char *patrn) {
    	short int i, w=0, len;
    	char *word = NULL;
    	regex_t rgT;
    	regmatch_t match;
    	regcomp(&rgT,patrn,REG_EXTENDED);
    	if ((regexec(&rgT,string,1,&match,0)) == 0) {
    		rgxp.bgn = (int)match.rm_so;
    		rgxp.end = (int)match.rm_eo;
    		len = rgxp.end-rgxp.bgn;
    		word=malloc(len+1);
    		for (i=rgxp.bgn; i<rgxp.end; i++) {
    			word[w] = string[i];
    			w++; }
    		word[w]=0; //make sure this string is terminated
    	}
    	regfree(&rgT);
    	return word;
    }
    regexp(yourline,"H.*O") (for example) will put the character positions of the match into global struct rgxp as rgxp.bgn and rgxp.end. The first character in a line is zero and I believe the "end" value is actually the character after the "O". If there is no match, you get a NULL pointer. If there is a match, you get the content of the match returned (ie. "HEL+O" could return "HELLO").

    Good luck. nb. this only returns the first match, but qv. the method in my next post for the next match
    Last edited by MK27; 09-04-2008 at 11:02 AM. Reason: correction..
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    What about the input of
    <mytag>
    "HELLO"
    </mytag>

    If it's well-formed XML, there's no reason to assume that everything will be neatly on one line for you to process.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    & the hat of GPL slaying Thantos's Avatar
    Join Date
    Sep 2001
    Posts
    5,681
    First, *? is not useful according to any regexp rules I've ever seen
    *? means that * shouldn't be greedy.

  6. #6
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    Cool

    (re: Salem) okay! parse an .xml into one line, presuming unix linefeeds:

    Code:
    char *linefile (char *file) {
    	size_t len, mem=0;
    	char *cumul, *line = NULL;
    	static char err[]="ERROR";
    	FILE *FST_mine = fopen(file, "r");
    	if (FST_mine == NULL) return line;
    	while ((line = linein(FST_mine)) != NULL) {
    		len = strlen(line);
    		if (mem == 0) { mem = len+1;
    			cumul = malloc(mem);
    			strcpy(cumul,line);
    		}
    		else {	mem += len;
    			cumul = (char *)realloc(cumul,mem);
    			strcat(cumul,line);
    		}
    	}
    	if (fclose(FST_mine) != 0) { puts("fclose fail linefile()");
    		return err;}
    	return cumul;
    }
    char *myline=linefile("myfile.xml")
    Feed that through the above "regexp" and each time it returns convert all characters in "myline" upto the last rgxp.end into spaces (remember to collect bgn and end somewhere):
    Code:
    for (i=0; i <rgxp.end; i++) myline[i]=32;
    Then maybe free(myline) and you'll end up with an array (or whatever you did to collect the bgns and ends) consisting of the character positions for HELLO in the entire file.

    (re: Thanatos) oh!

    For an explanation of "linein" see one of my posts on the next page...sorry...
    Last edited by MK27; 09-05-2008 at 08:44 AM. Reason: linein
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Do you want to know how many bugs that contains?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    Unhappy

    "Do you want to know how many bugs that contains?"

    Please please please! I'm not making ANY claims to expertise, but I have been using (more or less*) this method and had no end of success with it (in a text file search tool that gets used all the time without problems), so if you see something that is not right, fire away.

    *eg, normally i error check malloc()
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  9. #9
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Well, for one thing, your file parameter should probably be constant. ' ' is also a lot more readable than 32.

    For another, what does linein return? Is it an malloc()'ed value? (If so, why aren't you freeing it?)

    What happens if you read a string that is "ERROR"? Why not return NULL on error? [edit] Like you do when the file couldn't be opened? [/edit]

    What happens if the file was opened, but empty -- i.e., nothing is read at all? (You'll return a random value.)

    I'm sure this isn't an exhaustive list, what with Salem's post.
    Last edited by dwks; 09-04-2008 at 01:57 PM.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  10. #10
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > cumul = (char *)realloc(cumul,mem);
    Never assign the result of realloc back to the same pointer. If it returns NULL, you've leaked memory.

    p = realloc(cumul,mem);
    if ( p ) culum = p;
    else ....

    This gives you the opportunity to do something sensible with the memory you still own.

    Oh, and casting realloc as well.

    > static char err[]="ERROR";
    In addition to what dwks mentioned, how would you free this?
    Most of the time, you return a pointer to allocated memory, but this really messes things up.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  11. #11
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Also, a few more suggestions:
    Code:
    		else {	mem += len;
    			cumul = (char *)realloc(cumul,mem);
    			strcat(cumul,line);
    		}
    Why make strcat() traverse the line when you already know where the NULL terminator is?

    Code:
    puts("fclose fail linefile()");
    That's what perror() is for. Oh, and errors are often reported in the format "linefile(): fclose failed".

    Also, I don't know what "unix linefeeds" has to do with this. Since you open the file in "r" mode, a.k.a. "rt" (text) mode, on any platform you should get newline sequences of plain "\n", assuming the file was in the correct format for that platform in the first place.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  12. #12
    Registered User
    Join Date
    Oct 2001
    Posts
    2,129
    Quote Originally Posted by dwks View Post
    "r" mode, a.k.a. "rt" (text) mode
    There is no "rt" mode in standard C.

  13. #13
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Yes, I know, we had a discussion about that already. I guess I shouldn't have mentioned it. I just meant it in contrast to "rb", that's all.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  14. #14
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    Talking

    pick pick pick!

    Anyway, I appreciate it BUT/AND:

    What happens if you read a string that is "ERROR"? Why not return NULL on error? [edit] Like you do when the file couldn't be opened?
    Well, failing to open a file and failing to close it aren't the same, are they?

    What happens if the file was opened, but empty -- i.e., nothing is read at all? (You'll return a random value.)
    Huh...I hadn't tried that...I imagaine declaring cumul as NULL initially would prevent that.

    what does linein return? Is it an malloc()'ed value? (If so, why aren't you freeing it?)
    There is something I don't quite understand here. "linein" obviously returns a char pointer, so to me it would seem the malloc'n'freeing would/could go on OUTSIDE the function. In practice I just use the pointer, which is local to another function and freed with it I'm told...

    Never assign the result of realloc back to the same pointer. If it returns NULL, you've leaked memory.
    Huh again. Is that w/r/t realloc in particular, or all functions in general?

    Most of the time, you return a pointer to allocated memory, but this really messes things up.
    Now we are solidly in "I'm not sure what you mean here" land.

    Why make strcat() traverse the line when you already know where the NULL terminator is?
    By NULL terminator you mean "\0"? What is my other option (working character by character from mem to mem+len)? In that case, why does strcat even exist?

    > static char err[]="ERROR";
    In addition to what dwks mentioned, how would you free this?

    I wouldn't...

    what perror() is for. Oh, and errors are often reported in the format "linefile(): fclose failed".
    I'm going straight home to look up perror now. b/t/w you should have seen what it said before I made it "web friendly"...

    I don't know what "unix linefeeds" has to do with this.
    Sorry, I use linux and am new to C so I hadn't noticed that CR/LF (or whatever) is automatically "/n"

    Thanks again.
    ps. garcon (if you ever return) despite the issues, it still works like a charm
    pps. I like "32" because it's also the number you add to 'A' or 'B' or 'C' to make 'a' and 'b' and 'c'...maybe i should turn this off...
    Last edited by MK27; 09-04-2008 at 05:14 PM. Reason: pps.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  15. #15
    Registered User
    Join Date
    Sep 2008
    Posts
    10
    Quote Originally Posted by tabstop View Post
    Well, who knows what regexp engine you're using (hint: not us). And does the engine not have documentation?
    Well, I don't really know what it is called. I'm writing a C program on a Linux box, and all I did was to include regex.h and ldd tells me that the only shared lib that the binary loads is libc. Is this what you call POSIX regexp engine, perhaps?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Php regexp --> C++
    By michkine in forum C++ Programming
    Replies: 8
    Last Post: 02-07-2005, 01:19 PM