Thread: Please critique and suggest improvements for parser

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    35

    Please critique and suggest improvements for parser

    Overview: parse_field acts like fgets, except while fgets returns lines, parse_field returns whitespace delimited fields. However, unlike fgets, which simply returns the next line, parse_field also replaces all instances of '&' with char *user.

    Before I extend the attached code any further, I would like more experienced programmers to critique any flaws, particularly those regarding design and harmful programming practices. Seeing as this code serves as the foundation for my soon-to-be larger program, I want to make sure that it's stable (and hopefully bulletproof).

    However, char *field has been posing quite a bit of a problem (highlighted in red). If you notice, field is defined in parse_field, but since I return field, I have no chance to free it. Therefore, I made field global (which, if I'm not mistaken, is not the best programming practice) and free it in parse_line.

    Note: Although state is not being used yet, it will be later.

    Thanks in advance for any helpful criticism.
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    char *field;
    const char *save, *user = "username";
    int state = 0;
    
    char *parse_line(const char *line);
    char *parse_field(const char *line);
    
    int main(void) {
        char *line = "The user is & \t and  his home is /home/&";
        parse_line(line);
        return 0;
    }
    
    char *parse_line(const char *line) {
        while (parse_field(line)) {
            printf("Field: %s\n", field);
            free(field);
        }
    }
    
    char *parse_field(const char *line) {
        const char *ampersand, *end, *start, *traverse;
        int ampersand_count = 0;
        size_t len, userlen = strlen(user);
    
        if (save && *save == '\0') {
            save = NULL;
            return NULL;
        }
        if (!save) save = line;
        while(isspace(*save)) ++save;
        end = save;
        start = save;
        while (*end != '\0' && !isspace(*end)) ++end;
        save = end;
        len = end - start;
        field = malloc(len + 1);
        strncpy(field, start, len);
        field[len] = 0;
    
        traverse = start;
        for (; traverse != end; ++traverse) {
            if (*traverse == '&') ++ampersand_count;
        }
        if (ampersand_count > 0) {
            free(field);
            field = malloc(len + ampersand_count * (userlen - 1) + 1);
            if (field == NULL) {
                return NULL;
            }
            strncpy(field, start, len);
            field[len] = 0;
            while ((ampersand = strchr(field, '&'))) {
                memmove(ampersand + userlen, ampersand + 1, strlen(ampersand + 1));
                memcpy(ampersand, user, userlen);
            }
        }
    
        ++state;
        return field;
    }

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Well if you assigned the return result rather than just testing it, you wouldn't need an icky global.

    Code:
    void parse_line(const char *line) {
        char *temp;
        while ( (temp=parse_field(line)) != NULL ) {
            printf("Field: %s\n", temp);
            free(temp);
        }
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Quote Originally Posted by Salem
    Well if you assigned the return result rather than just testing it, you wouldn't need an icky global.

    Code:
    void parse_line(const char *line) {
        char *temp;
        while ( (temp=parse_field(line)) != NULL ) {
            printf("Field: %s\n", temp);
            free(temp);
        }
    }
    Thanks, Salem. Any suggestions and/or criticism on the rest of the code?

  4. #4
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Would you want to generalize the functions so that other delimiters and/or replacement text might be used?

    Might you want to use passed parameters of destination and size rather than doing dynamic allocation?
    Code:
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    
    int  parse_field(char *dst, size_t size, const char **src, char delim, const char *repl);
    /* implementation elsewhere */
    
    void parse_engine(const char *line, char delim, const char *user)
    {
        char output[16];
        while ( parse_field(output, sizeof output, &line, delim, user) )
        {
            printf("output = \"%s\"\n", output);
        }
    }
    
    int main(void)
    {
        parse_engine("The user is & \t and  &&&'s home is /home/&",
                     '&', "username");
        return 0;
    }
    
    /* my output
    output = "The"
    output = "user"
    output = "is"
    output = "username"
    output = "and"
    output = "username's"
    output = "home"
    output = "is"
    output = "/home/username"
    */
    That is, leave the size to the caller or the engine (as in this case)?
    Last edited by Dave_Sinkula; 09-18-2006 at 07:08 PM. Reason: Oops. Accidentally removed the colored 16.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Quote Originally Posted by Dave_Sinkula
    That is, leave the size to the caller or the engine (as in this case)?
    Could you provide that link, so that I can understand how it would actually work?

  6. #6
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Queue
    Could you provide that link, so that I can understand how it would actually work?
    Did you mean this (that I accidentally munged in an edit)?
    Code:
    char output[16];
    Or my parse_field implementation?
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Quote Originally Posted by Dave_Sinkula
    Did you mean this (that I accidentally munged in an edit)?
    Code:
    char output[16];
    Or my parse_field implementation?
    Ah, I misinterpreted your blue highlighting as a link. I still would be interested in how parse_field would be implemented, though.

  8. #8
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Aw, what the heck. I'm pretty sure you may take this in different directions anyway.
    Code:
    int parse_field(char *dst, size_t size, const char **src, char delim, 
                    const char *replacement_text)
    {
       size_t i, replen = strlen(replacement_text);
       int intext = 0;
       for ( --size, i = 0; i < size; ++i )
       {
          if ( **src == '\0' )
          {
             if ( i == 0 )
             {
                return 0; /* nothing more to do */
             }
             break;
          }
          if ( isspace(**src) )
          {
             if ( intext )
             {
                break;
             }
             while ( isspace(**src) )
             {
                ++*src;
             }
          }
          else
          {
             intext = 1;
          }
          if ( **src != delim )
          {
             *dst++ = **src;
             ++*src;
          }
          else
          {
             while ( **src == delim )
             {
                ++*src;
             }
             if ( i + replen < size )
             {
                memcpy(dst, replacement_text, replen);
                dst += replen;
                intext = 1;
             }
             else
             {
                size_t maxlen = size - i;
                fprintf(stderr,
                        "[warning] parse buffer size too small (%d), need %d"
                        " : \"%.*s\" + \"%s\"\n", 
                        (int)(size + 1), (int)(i + replen + 2), 
                        (int)i, dst - i, replacement_text);
                memcpy(dst, replacement_text, maxlen);
                dst += replen;
                break;
             }
          }
       }
       *dst = '\0';
       return 1; /* continue parse engine */
    }
    With this kinda thing, I'm just balking when the input buffer is too small. If the stdout is redirected to a file, you'd get a message for each instance in which this is the case, such as with output having a size of 12. Without the redirect it might show something like this.
    Code:
    /* my output (buffer size = 12)
    [warning] parse buffer size too small (12), need 16 : "/home/" + "username"
    output = "The"
    output = "user"
    output = "is"
    output = "username"
    output = "and"
    output = "username's"
    output = "home"
    output = "is"
    output = "/home/usern"
    */
    Now, instead of doing this, it could perhaps be redone in a way in which malloc is first called for the output buffer. Then if in the "buffer size too small" case is reached the function could return a 2 or something. Then you could use this return value to cause a realloc to, say double, the size of the output buffer. Or something like that. That way you'd be doing less malloc'ing and realloc'ing.

    [edit]BTW, this is an FSM of sorts. The states are just not separated well.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Thanks for posting that, Dave_Sinkula. Of course, now that I've seen another way to accomplish the same goal, I'm curious to know why you've implemented some parts the way that you did.

    1. Instead of using a global pointer to save the location of the parser on the function's exit, you pass in the address of the pointer itself, while incrementing it based on the conditions. Which method is preferable: using the global namespace or passing addresses as such?
    2. You use char *dst (or char *output) to store the characters as they're being copied from char **src. On the other hand, I return the value of char *field; that way, the function calling parse_field can choose not to assign the return value to a variable. Is one method better than the other?
    3. Although you use a FSM-like design, how exactly is it advantageous over my method?

    Thanks again for taking the time to answer my questions.

  10. #10
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    1. I prefer to avoid globals, I live in that hell eternally.
    2. I just like to avoid dynamic allocation, too. I've also found that the standard approach, like with strcpy, etc., is generally a good pattern to follow.
    3. It's just something else to look at, more or less, to help give you other ideas.

      [edit]Making it into a full-fledged FSM is a bit of work, but if things are really more complex than you've shown, it can help simplify things. It depends on the direction the code takes.[/edit]
    Last edited by Dave_Sinkula; 09-18-2006 at 10:06 PM. Reason: Added blather.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Quote Originally Posted by Dave_Sinkula
    Making it into a full-fledged FSM is a bit of work, but if things are really more complex than you've shown, it can help simplify things. It depends on the direction the code takes.
    Unfortunately, the parsing doesn't get much more complex than what I've shown thus far. Still, I'm very curious as to what a full-fledged FSM looks like code-wise (especially without the use of a stack, as every example I've seen so far uses one).

    Just to give you an idea of the code's direction, the final version of parse_line will ideally look somewhat like the code block below. Given the following, would I benefit from designing the parser as a FSM? I might need help implementing the commented part as well.

    Code:
    int state = 0;
    char *field;
    while ((field = parse_field(line))) {
            ++state;
            printf("Field: %s\n", field);
            switch (state) {
                    case 1 : state = parse_user(field);
                             /* if state = 0, continue to case 2; else, continue to case 3 */
                             break;
                    case 2 : state = parse_group(field);
                             /* if state = 0, continue to case 3; else, exit */
                             break;
                    case 3 : parse_foo(field);
                             break;
                    ...
            }
            free(field);
    }

  12. #12
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Queue
    Still, I'm very curious as to what a full-fledged FSM looks like code-wise (especially without the use of a stack, as every example I've seen so far uses one).
    This might get close, but it's still a hack job.
    Code:
    /* parser.h */
    #ifndef PARSER_H
    #define PARSER_H
    
    void parse_engine(const char *src, char delim, const char *replacement_text);
    
    #endif
    Code:
    /* parse_main.c */
    #include "parser.h"
    
    int main(void)
    {
       parse_engine("The user is * \t and  ***'s home is /home/*", '*', "username");
       return 0;
    }
    Code:
    /* parser.c */
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include "parser.h"
    
    struct fsm
    {
       int (*state_function)(struct fsm *this_fsm); /* current state function */
       char *dst;
       char destination[12];
       const char **source;
       const char *replacement_text;
       char delimiter;
       size_t size;
    };
    
    static int parse_init(struct fsm *this_fsm);
    static int parse_leading_whitespace(struct fsm *this_fsm);
    static int parse_regular_text(struct fsm *this_fsm);
    static int parse_replacement_text(struct fsm *this_fsm);
    
    void parse_engine(const char *src, char delim, const char *replacement_text)
    {
       struct fsm myfsm = {parse_init};
       myfsm.dst = myfsm.destination;
       myfsm.source = &src;
       myfsm.delimiter = delim;
       myfsm.replacement_text = replacement_text;
       myfsm.size = sizeof myfsm.destination - 1;
       while ( myfsm.state_function(&myfsm) )
       {
          //printf("output = \"%s\"\n", myfsm.destination);
       }
    }
    
    static int parse_init(struct fsm *this_fsm)
    {
       if ( **this_fsm->source == '\0' )
       {
          return 0;
       }
       this_fsm->state_function = parse_leading_whitespace;
       return 1;
    }
    
    static int parse_leading_whitespace(struct fsm *this_fsm)
    {
       while ( isspace(**this_fsm->source) )
       {
          ++*this_fsm->source;
       }
       this_fsm->state_function = parse_regular_text;
       return 1;
    }
    
    static int parse_regular_text(struct fsm *this_fsm)
    {
       if ( **this_fsm->source == this_fsm->delimiter )
       {
          this_fsm->state_function = parse_replacement_text;
       }
       else if ( isspace(**this_fsm->source) )
       {
          *this_fsm->dst = '\0';
          printf("\"%s\"\n", this_fsm->destination);
          this_fsm->state_function = parse_init;
          this_fsm->dst = this_fsm->destination;
       }
       else
       {
          if ( this_fsm->dst - this_fsm->destination <
               sizeof this_fsm->destination / sizeof *this_fsm->destination )
          {
             *this_fsm->dst++ = *(*this_fsm->source)++;
          }
       }
       return 1;
    }
    
    static int parse_replacement_text(struct fsm *this_fsm)
    {
       size_t replen = strlen(this_fsm->replacement_text);
       while ( **this_fsm->source == this_fsm->delimiter )
       {
          ++*this_fsm->source;
       }
       if ( (this_fsm->dst - this_fsm->destination) + replen < this_fsm->size )
       {
          memcpy(this_fsm->dst, this_fsm->replacement_text, replen);
          this_fsm->dst += replen;
       }
       else
       {
          int len = this_fsm->dst - this_fsm->destination;
          size_t maxlen = (this_fsm->size - 1) - len;
          fprintf(stderr,
                  "[warning] parse buffer size too small (%d), need %d"
                  " : \"%.*s\" + \"%s\"\n",
                  (int)sizeof this_fsm->destination,
                  (int)(len + replen + 2), len,
                  this_fsm->destination, this_fsm->replacement_text);
          memcpy(this_fsm->dst, this_fsm->replacement_text, maxlen);
          this_fsm->dst += replen;
          this_fsm->dst = '\0';
       }
       if ( isspace(**this_fsm->source) )
       {
          this_fsm->state_function = parse_regular_text;
       }
       else
       {
          printf("\"%s\"\n", this_fsm->destination);
          this_fsm->state_function = parse_init;
       }
       return 1;
    }
    Quote Originally Posted by Queue
    Just to give you an idea of the code's direction, the final version of parse_line will ideally look somewhat like the code block below. Given the following, would I benefit from designing the parser as a FSM? I might need help implementing the commented part as well.
    Apologies for now, perhaps I'll take a closer look later.

    One thing I thought of about #2 above is that it allows for the parse_engine to choose between static and dynamic allocation of the array.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  13. #13
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    Quote Originally Posted by Dave_Sinkula
    This might get close, but it's still a hack job.
    I'm curious as to why you consider it a hack job. If the reason is simply the amount of time you put into it, that makes sense. Looking at the code, though, I must say I don't really see how it could be any cleaner. Please correct me if that's not true. Thanks so much for posting the entire FSM.
    Quote Originally Posted by Dave_Sinkula
    One thing I thought of about #2 above is that it allows for the parse_engine to choose between static and dynamic allocation of the array.
    Is that why you have char *dst and char destination[12]?

    To reiterate a concern that I brought up earlier, would it benefit me at all to use a FSM? Honestly though, now that you've shown me how to properly implement a FSM, I will probably end up using it just to keep the logic clean. If I understand it properly, the only real downside to using a FSM compared to my previous method would be the overhead.

  14. #14
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by Queue
    I'm curious as to why you consider it a hack job. If the reason is simply the amount of time you put into it, that makes sense. Looking at the code, though, I must say I don't really see how it could be any cleaner. Please correct me if that's not true. Thanks so much for posting the entire FSM.
    A bit regarding the time. Also the fact that it is only partially separated -- for example, printing should be a separate state; or some of the more complex ones could be broken down. To me the parse_field did the same thing in a simpler and easier-to-follow manner, so I think this overcomplicates it.

    Quote Originally Posted by Queue
    Is that why you have char *dst and char destination[12]?
    I chose pointer math over an index. I suppose it would have been better to leave destination out of the structure.

    Quote Originally Posted by Queue
    To reiterate a concern that I brought up earlier, would it benefit me at all to use a FSM? Honestly though, now that you've shown me how to properly implement a FSM, I will probably end up using it just to keep the logic clean. If I understand it properly, the only real downside to using a FSM compared to my previous method would be the overhead.
    I think I would choose to use parse_field as shown earlier to get the whitespace-separated fields, and then use a FSM for the remaining. But that really depends on what it is the sequence should be. But assuming that there is no relationship of the first two states with the second, because I don't know that I know the conditions, I might choose this sort of direction.
    Code:
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    
    int parse_field(char *dst, size_t size, const char **src, char delim, 
                    const char *replacement_text)
    {
       size_t i, replen = strlen(replacement_text);
       int intext = 0;
       for ( --size, i = 0; i < size; ++i )
       {
          if ( **src == '\0' )
          {
             if ( i == 0 )
             {
                return 0; /* nothing more to do */
             }
             break;
          }
          if ( isspace(**src) )
          {
             if ( intext )
             {
                break;
             }
             while ( isspace(**src) )
             {
                ++*src;
             }
          }
          else
          {
             intext = 1;
          }
          if ( **src != delim )
          {
             *dst++ = **src;
             ++*src;
          }
          else
          {
             while ( **src == delim )
             {
                ++*src;
             }
             if ( i + replen < size )
             {
                memcpy(dst, replacement_text, replen);
                dst += replen;
                intext = 1;
             }
             else
             {
                size_t maxlen = size - i;
                fprintf(stderr,
                        "[warning] parse buffer size too small (%d), need %d"
                        " : \"%.*s\" + \"%s\"\n", 
                        (int)(size + 1), (int)(i + replen + 2), 
                        (int)i, dst - i, replacement_text);
                memcpy(dst, replacement_text, maxlen);
                dst += replen;
                break;
             }
          }
       }
       *dst = '\0';
       return 1; /* continue parse engine */
    }
    
    void parse_user(const char *field)
    {
       for ( ;; )
       {
          size_t size = strcspn(field, ",");
          printf(" [parse_user] field = \"%.*s\"\n", (int)size, field);
          field += size;
          if ( *field == '\0' )
          {
             break;
          }
          ++field;
       }
    }
    
    void parse_group(const char *field)
    {
       for ( ;; )
       {
          size_t size = strcspn(field, ",");
          printf(" [parse_group] field = \"%.*s\"\n", (int)size, field);
          field += size;
          if ( *field == '\0' )
          {
             break;
          }
          ++field;
       }
    }
    
    void parse_other(const char *field)
    {
       printf(" [parse_other] field = \"%s\"\n", field);
    }
    
    void parse(const char *line, const char *user)
    {
       char field[32];
       size_t state;
       for ( state = 0; parse_field(field, sizeof field, &line, '&', user); ++state )
       {
          printf("[parse] field = \"%s\"\n", field);
          switch ( state )
          {
          case 0:  parse_user(field);  break;
          case 1:  parse_group(field); break;
          default: parse_other(field); break;
          }
       }
    }
    
    int main(void)
    {
       parse("user1,user2,&,user4 group1,group2,group3 string1 string2 string3", "user3");
       return 0;
    }
    
    /* my output
    [parse] field = "user1,user2,user3,user4"
     [parse_user] field = "user1"
     [parse_user] field = "user2"
     [parse_user] field = "user3"
     [parse_user] field = "user4"
    [parse] field = "group1,group2,group3"
     [parse_group] field = "group1"
     [parse_group] field = "group2"
     [parse_group] field = "group3"
    [parse] field = "string1"
     [parse_other] field = "string1"
    [parse] field = "string2"
     [parse_other] field = "string2"
    [parse] field = "string3"
     [parse_other] field = "string3"
    */
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  15. #15
    Registered User
    Join Date
    Sep 2006
    Posts
    35
    After rethinking over the design, I believe that I may in fact implement the parser as a FSM. In addition to all the previous functionality, I believe that multiple wildcards may improve its usability. As such, my current ideal solution includes using a hash table so that by hashing a character, I could retrieve the replacement text. However, one downside would be that in order for the hash to work properly, I would have to test each character against the list of reserved wildcards. In conjunction, I would also like to implement escape character parsing via '\'. Would those extra functions solicit the FSM implementation?

Popular pages Recent additions subscribe to a feed