Parsing for Dummies

**MisterWonderful** · 03-07-2004

Okay— yes, I have checked every resource of cprogramming.com, and yet I am still confused about the concept of string parsing, and/or separating strings into tokens. It isn't that there isn't enough information; I just haven't found any info that's "dumbed-down" enough for someone who is completely new to parsing.

So I know what parsing is.
But what I'm wondering is, what are the rudimentary basics of parsing and string separation, in terms of programming, and how they work logically?

Many thanks, I know that this is a bit vague, I tried to focus my question as well as I could.

**/Muad'Dib\** · 03-07-2004

It depends on what your looking for. If you want to take the string "This is a string. This is another part of this string." And separate it into two strings (one for each sentance) you would iterate until you found a period. If you wanted to parse all words out of a string into multiple strings you could use pointers to iterate until they found a non alpha-numerical character, take the difference between the two pointers, and copy the data into a new string.
There are multiple ways of parsing strings, and all of them rely on some form of iteration.

**elad** · 03-07-2004

In my mind parsing involves taking a given original string and looking for a given set of targets--which may be substrings or single characters. The full string is either physically or logically subdivided into resultant substrings (aka tokens) using some protocol if a target is found.

**Sebastiani** · 03-07-2004

Here's an example of two very rudimentary parsing implementations to illustrate the basic concept. I hope it helps.

Both use a pointer to a pointer to a buffer (ie: a char**) to keep track of the current position. The first example simply reads in tokens that are separated by one or more of a certain character (usually a space).

Code:

/*
  return values:
     0 : success
    -1 : done
   > 0 : buffer too small, need this
         many bytes for next token.
*/
 int token(
  char * buffer, 
  int max, 
  char ** next, 
  char sep)
{
 const char * ptr, * start = NULL;
  
     for(ptr = *next; *ptr; ++ptr)
    {
         if(*ptr != sep)
        {
             if(start == NULL)
            {
             start = ptr; // first char in token
            } 
        }
         else 
        { 
             if(start != NULL)
            {
             break; // ready to copy
            } 
        }
    }

    if(start == NULL)
   {
    return -1; // done
   }  

 int lcopy = ptr-start;
            
     if(lcopy > max - 1) // max - 1 for null-term
    {
     return lcopy + 1; // need a buffer this big
    }

 strncpy(buffer, start, lcopy);
                 
 buffer[lcopy] = 0;                
                 
 *next = (char*)ptr;
                 
 return 0;
}

It's usage would be like this:

Code:

 int main()
{
 const int bufsz = 1024;
 
 char buf[bufsz]; 

 char data[] = "This    data...needs   to be  parsed  ";

 char * iter = data;
 
     while(0 == token(buf, bufsz, &iter, ' '))
    {
     printf("Token: '%s'.\n", buf);
    }

 return 0;
}

The next one takes a slightly different approach by skipping over anything that doesn't match a certain 'token-type' string. It uses two helper functions, find_first_of and find_first_not_of in order to accomplish that.

Code:

 const char * find_first_of(
  const char * str, 
  const char * find)
{
 const char * ptr;
 
     for( ; *str; ++str)
    {
         for(ptr = find; *ptr; ++ptr)
        {
             if(*str == *ptr)
            {
             return str;
            } 
        }
    }
         
 return NULL;
}


 const char * find_first_not_of(
  const char * str, 
  const char * find)
{
 const char * ptr = find;

 bool found;
 
     for( ; *str; ++str)
    {
     found = false;
    
         for(ptr = find; *ptr; ++ptr)
        {
             if(*str == *ptr)
            {
             found = true;
            } 
        }
        
        if(!found)
       { 
        return str;
       } 
    }
    
 return NULL;
}


/*
  return values:
     0 : success
    -1 : done
   > 0 : buffer too small, need this
         many bytes for next token.
*/
 int token(
  char * buffer, 
  int max, 
  char ** next, 
  const char * match)
{
 const char * start = find_first_of(*next, match);
 
     if(start == NULL)
    {
     return -1;
    } 
 
 const char * ptr = find_first_not_of(start, match);
 
     if(ptr == NULL)
    {
     ptr = &start[strlen(start)-1];
    }   
              
 int lcopy = ptr-start;
            
     if(lcopy > max - 1) 
    {
     return lcopy + 1; 
    }

 strncpy(buffer, start, lcopy);
                 
 buffer[lcopy] = 0;                
                 
 *next = (char*)++ptr;
                 
 return 0;
}

The usage for that would be:

Code:

 int main()
{
 const int bufsz = 1024;
 
 char buf[bufsz]; 

 const char * accept = "abcdefghijklmnopq"
                       "rstuvwxyzABCDEFGH"
                       "IJKLMNOPQRSTUVWXYZ";

 char data[] = "This    data...needs   to be  parsed  ";
 
 char * iter = data;
 
     while(0 == token(buf, bufsz, &iter, accept))
    {
     printf("Token: '%s'.\n", buf);
    }

 return 0;
}

**MisterWonderful** · 03-08-2004

Woah... thanks... I'm going to have to sit down with a cup of coffee and begin attempting to "translate" all of that in my mind... haha...thanks.

Thread: Parsing for Dummies

Thread Tools

Search Thread

Display

Parsing for Dummies

Similar Threads

need sth about parsing

added start menu crashes game

draw tree graph of yacc parsing

Need help fixing bugs in data parsing program

I hate string parsing with a passion