Thread: Split string with regular expression

  1. #1
    Registered User catacombs's Avatar
    Join Date
    May 2019
    Location
    /home/
    Posts
    66

    Split string with regular expression

    I'm trying to split this string:

    Code:
    "SPEAKER: Hello there, my friend. SPEAKER2: How are you, friend?"
    into an array.

    I know to use the strtok function from the strings header file, but I'm running into an issue with the output.

    Here is some test code:

    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main(int argc, char** argv)
    {
    
        char* re = "[A-Z]+:(.)";
        char text[400] = "SPEAKER: Hello there, my friend. SPEAKER2: How are you, friend?";
        char* token = strtok(text, re);
    
        while (token != NULL) {
            printf("%s\n", token);
            token = strtok(NULL, re);
        }
        return 0;
    }
    This is the result:

    Code:
    SPE
    KER
     Hello there, my friend
     SPE
    KER2
     How are you, friend?
    The code seems to split on 'A,' and I can't figure out if my regex is incorrect or something like this isn't possible in ANSI C. I've ran the regex through multiple debuggers, and they say its valid.

    The ideal result would be:

    Code:
    SPEAKER
    Hello there, my friend
    SPEAKER2
    How are you, friend?

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    26,896
    strtok does not work with regular expressions. You may have to use a third party library.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Feb 2019
    Posts
    408
    GNU Libc, page 247. If you are using GCC and glibc...

  4. #4
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,836
    Why not just use strtok the way it is supposed to be used?

    It can split up the string using the delimiters and then you can do what you want with it.

  5. #5
    Registered User catacombs's Avatar
    Join Date
    May 2019
    Location
    /home/
    Posts
    66
    Quote Originally Posted by laserlight View Post
    strtok does not work with regular expressions. You may have to use a third party library.
    Yeah, after doing more research after posting it, the only thing I've found was the pcre library, but that doesn't do the intended split. If you have a third-party library I don't know about, I'm all ears.

    Quote Originally Posted by flp1969 View Post
    GNU Libc, page 247. If you are using GCC and glibc...
    Thanks. That helped a little. It seems I might have to split everything based on the space then do a regex check for the all-caps words I'm looking for.

    My end goal is an array of structs:

    Code:
    {
        {"SPEAKER:", "HELLO THERE FRIEND"},
        {"SPEAKER2:", "GOOD BYE"},
    }

    Quote Originally Posted by Click_here View Post
    Why not just use strtok the way it is supposed to be used?

    It can split up the string using the delimiters and then you can do what you want with it.
    I needed something a little more specific and wanted to avoid just splitting on a space and keep track of when an all-caps word appeared. Unfortunately, that's looking like the only path forward.

  6. #6
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    606
    In what sense does it not do the "intended split"? That doesn't make sense. Maybe your regex is wrong or you are using the library incorrectly. You need to show some code.

    An alternative: Regular Expressions (The GNU C Library)

    Example:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <regex.h>
     
    int main() {
        regex_t re;
     
        if (regcomp(&re, "^ *[A-Z]+ *: *(.*)", REG_EXTENDED) != 0) {
            perror("regcomp");
            exit(EXIT_FAILURE);
        }
     
        char str[] = "  HELLO  :  there world";
     
        regmatch_t m[2];
        if (regexec(&re, str, 2, m, 0) == 0) {
            int start = m[1].rm_so, end = m[1].rm_eo;
            printf("match: %d %d\n", start, end);
            char ch = str[end];
            str[end] = '\0';
            printf("substring: |%s|\n", str + start);
            str[end] = ch;
        }
        else
            printf("no match\n");
     
        return 0;
    }
    Output:
    Code:
    match: 12 23
    substring: |there world|
    Last edited by john.c; 06-18-2019 at 01:34 PM.
    Let him who is not come to logic be plagued with continuous and everlasting filth.
    - John of Salisbury, 1160

  7. #7
    Registered User catacombs's Avatar
    Join Date
    May 2019
    Location
    /home/
    Posts
    66
    Quote Originally Posted by john.c View Post
    In what sense does it not do the "intended split"? That doesn't make sense. Maybe your regex is wrong or you are using the library incorrectly.
    My regex was not wrong. As laserlight pointed out, strtok doesn't take regular expressions, and that was why I didn't get the expected output.

    You need to show some code.
    I already provided code. Run my example.

  8. #8
    Registered User john.c's Avatar
    Join Date
    Dec 2017
    Posts
    606
    Obviously it won't work with strtok. I'm not an idiot!
    You said that you tried pcre and it didn't work.
    Your regex was almost certainly wrong.
    Let him who is not come to logic be plagued with continuous and everlasting filth.
    - John of Salisbury, 1160

  9. #9
    Registered User
    Join Date
    Feb 2019
    Posts
    408
    Quote Originally Posted by catacombs View Post
    My regex was not wrong.
    Yes it is... For your example: "SPEAKER: Hello there, my friend. SPEAKER2: How are you, friend?" the regex:
    Code:
    [A-Z]+:(.)
    Should match only a single space.
    Last edited by flp1969; 06-18-2019 at 03:28 PM.

  10. #10
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    26,896
    If you're literally testing with "SPEAKER2", that [A-Z]+ won't match it either.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  11. #11
    Registered User
    Join Date
    Feb 2019
    Posts
    408
    Here a simple example on how you can test your regex:
    Code:
    $ echo "\"$(echo "SPEAKER: Hello there, my friend. SPEAKER2: How are you, friend?" | grep -oE '[A-Z]+:(.)')\""
    "SPEAKER: "

  12. #12
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,836
    I needed something a little more specific and wanted to avoid just splitting on a space and keep track of when an all-caps word appeared
    I still don't understand why you can't use strtok the way that it is supposed to, and then just write a few extra functions to create the funcitonality that you need...

    Code:
    struct StringInput
    {
      enum StringType Type;
      char Value[MAXLENGTH];
    };
    
    ...
    
    struct StringInput banana;
    
    ...
    
    //  Get next using strtok...
    
    ...
    
    if( StringIsUpper( inputString) )
    {
      banana.Type = ANNOUNCER;
    }
    else
    {
      banana.Type = DATA;
    }

    Don't be one of those people that spend 2 days finding a header for a task that can be done in 20min.


    At my work we have a joke about a mythological person that spent 2 weeks trying to find a "IsEven" function. Don't be that guy.
    Fact - Beethoven wrote his first symphony in C

  13. #13
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    26,896
    Quote Originally Posted by Click_here
    I still don't understand why you can't use strtok the way that it is supposed to, and then just write a few extra functions to create the funcitonality that you need...
    strtok has its place, but depending on catacombs' requirements, it may be too much of a blunt instrument, e.g., maybe whitespace must be preserved. Note that your example is flawed: identifying the speaker token requires checking for a terminating ':', not merely checking for a fully uppercase string.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  14. #14
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,836
    Quote Originally Posted by laserlight View Post
    strtok has its place, but depending on catacombs' requirements, it may be too much of a blunt instrument, e.g., maybe whitespace must be preserved. Note that your example is flawed: identifying the speaker token requires checking for a terminating ':', not merely checking for a fully uppercase string.
    I'm sorry laserlight, but you are making assumptions about a function that was never writen... And missing the point completely...

    The point that I was trying to make is - Sure strtok didn't work as you espected it, but you can still use it.

    It would be extremely easy to see if a word ends with a ':', easy to check to see if it is upper (or better yet, not lower so you can have numbers and symbols "SPEAKER1_SUBCATEGORY2:").

    Once you have worked out if it is a tag or data you can either append it to an existing string (adding a space), or finishing the last string and starting the next (and doing whatever with it).
    Fact - Beethoven wrote his first symphony in C

  15. #15
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    26,896
    Quote Originally Posted by Click_here
    I'm sorry laserlight, but you are making assumptions about a function that was never writen..
    A perfectly valid assumption for a function named StringIsUpper. That's why you should have named it StringIsAnnouncer

    Quote Originally Posted by Click_here
    And missing the point completely...
    I'm sorry Click_here but you are making assumptions about a criticism that was never made..

    My point is that whether strtok is a good choice depends on catacombs' exact requirements, not that your idea of using strtok with some token identification necessarily cannot work. If whitespace must be preserved, the original idea of regex could prove to be a better fit.

    I understand your point very well, but your elaboration is good too even though I don't need it.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Simple regular expression to find word in a string?
    By BC2210 in forum C Programming
    Replies: 1
    Last Post: 03-28-2010, 07:41 PM
  2. Regular Expression
    By csonx_p in forum Tech Board
    Replies: 8
    Last Post: 09-03-2008, 09:10 AM
  3. Regular Expression
    By stevesmithx in forum C Programming
    Replies: 0
    Last Post: 02-18-2008, 11:00 AM
  4. Regular Expression
    By tintifaxe in forum C++ Programming
    Replies: 3
    Last Post: 06-14-2006, 07:16 AM
  5. Regular Expression..
    By vasanth in forum Tech Board
    Replies: 3
    Last Post: 08-03-2004, 07:56 AM

Tags for this Thread