Thread: <regex.h> regex syntax in C

  1. #1
    Registered User
    Join Date
    Mar 2004
    Posts
    4

    Question <regex.h> regex syntax in C

    Hi there,

    I have a regular expression (regex) that I would like to understand exactly what it means, in the context of the libc library, regex.h, in C. I run software that generates regular expressions, sort of at random but according to a grammar, and it generated one particular one that I'm very interested in. I have done an extensive regex tutorial and read the POSIX 1003.2 specification for regexes but am still unsure as to what it means.

    The regex is
    Code:
    [A[ACGTN][ACGTN]G]*
    It compiles and executes without error. I thought that nested character classes were not allowed, and yet this was executed.

    My interpretation so far is as follows:
    one single character which could be A, [, A, C, G, T, or N,
    followed by one single character which could be A, C, G, T, or N,
    followed by the character 'G',
    followed by zero or more repetitions of the character ']'.

    Does anyone know if I've got it wrong or right? I'd appreciate any help anyone can give on this. (I know it isn't the most elegant regex, but I still need to know what it means ).

    many thanks,
    simon

  2. #2
    Me want cookie! Monster's Avatar
    Join Date
    Dec 2001
    Posts
    680
    I think it's for reading DNA strings:

    A
    followed by one single character A, C, G, T or N
    followed by one single character A, C, G, T or N
    followed by one single character G

    The above is repeated zero or more time.

    I think the * must be a + (one or more time).

    A valid example is: AAAGACAGATNGANAG

  3. #3
    Registered User
    Join Date
    Mar 2004
    Posts
    4
    thankyou very much for the quick reply!

    yes, it is for reading DNA strings. Also, I don't want to change it, I just want to understand it as it is.
    One thing: I don't see how your solution accounts for the nesting of the character classes (maybe i'm a bit slow!). Are you saying that nested character classes are allowed and do work in the C library regex.h? Could you explain exactly how do you break down the regex? For example, this is how I initially saw it:

    Code:
    [A[ACGTN]
    matches A, [, A, C, G, T, or N
    Code:
    [ACGTN]
    matches A, C, G, T, or N
    Code:
    G]*
    matches G followed by zero or more right square brackets (they don't match any left square brackets)

    So, valid matches for this would be: AAG, AAG]]]]], [CG]], NTG. Could you please clarify how you are breaking down the regex?

    sorry i'm not too clear on this,
    many thanks again,
    simon

  4. #4
    Me want cookie! Monster's Avatar
    Join Date
    Dec 2001
    Posts
    680
    Maybe this site will help you understand: http://old.scriptmeridian.org/projec...s/classes.html

    especially this line: If you want to include a "]" in a character list, either include it as the first character (eg "[]]"), or escape it using a backslash (eg "[\\]]").

    All [ and ] characters in your example are special characters (bracket expression).

  5. #5
    Registered User
    Join Date
    Mar 2004
    Posts
    4
    thanks a million, I appreciate your time. That url is helpful.

    May I clear up one thing?
    All [ and ] characters in your example are special characters (bracket expression).
    This means that there _are_ nested charatcer classes, and they can be compiled and exectued without error. I know that a Java implementation of regular expressions allows the nesting of character classes and, in that case, it means something like the union of the two classes.
    So, are you saying that you know that regex.h library in C allows nested character classes? If so, what does it _mean_ when you have a nested character class? It seems like you are treating the outer set of square brackets as a grouping set of brackets (round brackets) for a sequence, but I thought that the whole thing would only match one single character which could be any of the characters within (including those within the inner character classes???

    thanks again,
    simon

  6. #6
    Me want cookie! Monster's Avatar
    Join Date
    Dec 2001
    Posts
    680
    battersausage, you are right

    I made a mistake with the square brackets. I thought it was: (A[ACGTN][ACGTN]G) *

    I don't see the reason to use nested character classes. I'm not even sure if it is valid.
    Did you try to run the app and see what it's matching?

  7. #7
    Registered User
    Join Date
    Mar 2004
    Posts
    4

    Thumbs up

    i just ran it on some data and here's some of the results: TTG, NNG, GTG, AAG, AGG, TCG, NAG. Every one of them was a three letter solution with a G in the last position, and the first two letters from either A, C, G, T, or N.
    Interestingly enough, when I tried the following regex I got: CG, NG, AG, TG, GG.
    Code:
    [A[ACGTN]G]*
    Again, every one of them with a G in the last position, and the first letter from either A, C, G, T, or N.
    It seems to include the first A and [ACGTN] in the first letter match, followed by a match from the second [ACGTN] followed always by a G.
    Also, interesting to note is that it always returns a 3 character combo, never repeats longer than 3 characters, when using the original regex even though the * means zero or more matches. The test data does have sequences longer than 3 characters as well so it could match longer sequences if it wanted to, but it seems to restrict itself to 3 character sequences. So, it matches exactly 3 characters in the original grammar.

    Having run several more experiments I now understand what it's doing!
    In summary, here's the original regex followed by it's explanation
    Code:
    [A[ACGTN][ACGTN]G]*
    The first right bracket matches the first left bracket, thereby creating a character class. That character class consists of the following characters: A, [, A, C, G, T, and N. (note that the second left bracket in the regex is included as a normal character in the character class). This first character class is followed _in sequence_ by a second character class containing A, C, G, T, and N. This is followed _in sequence_ by the character G. This is then followed _in sequence_ by zero or more right brackets. So if I had right brackets in my test data I might come up with matches like: TTG]]]], NNG, GTG]], AAG], AGG]]]]], TCG], NAG]]]]]]]]]]]]]]]]]]]]]]]], etc.

    thanks for your help monster!!
    simon

  8. #8
    Me want cookie! Monster's Avatar
    Join Date
    Dec 2001
    Posts
    680
    Originally posted by battersausage
    thanks for your help monster!!
    simon
    I think I have to thank you Simon. You're the one who solved it and gave me a lesson in regex.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. more then 100errors in header
    By hallo007 in forum Windows Programming
    Replies: 20
    Last Post: 05-13-2007, 08:26 AM
  2. We Got _DEBUG Errors
    By Tonto in forum Windows Programming
    Replies: 5
    Last Post: 12-22-2006, 05:45 PM
  3. Using VC Toolkit 2003
    By Noobwaker in forum Windows Programming
    Replies: 8
    Last Post: 03-13-2006, 07:33 AM
  4. Connecting to a mysql server and querying problem
    By Diod in forum C++ Programming
    Replies: 8
    Last Post: 02-13-2006, 10:33 AM
  5. Dikumud
    By maxorator in forum C++ Programming
    Replies: 1
    Last Post: 10-01-2005, 06:39 AM