Parsing for Dummies

Okay— yes, I have checked every resource of cprogramming.com, and yet I am still confused about the concept of string parsing, and/or separating strings into tokens. It isn't that there isn't enough information; I just haven't found any info that's "dumbed-down" enough for someone who is completely new to parsing.

So I know what parsing is.
But what I'm wondering is, what are the rudimentary basics of parsing and string separation, in terms of programming, and how they work logically?

Many thanks, I know that this is a bit vague, I tried to focus my question as well as I could.

It depends on what your looking for. If you want to take the string "This is a string. This is another part of this string." And separate it into two strings (one for each sentance) you would iterate until you found a period. If you wanted to parse all words out of a string into multiple strings you could use pointers to iterate until they found a non alpha-numerical character, take the difference between the two pointers, and copy the data into a new string.
There are multiple ways of parsing strings, and all of them rely on some form of iteration.

In my mind parsing involves taking a given original string and looking for a given set of targets--which may be substrings or single characters. The full string is either physically or logically subdivided into resultant substrings (aka tokens) using some protocol if a target is found.

Here's an example of two very rudimentary parsing implementations to illustrate the basic concept. I hope it helps. :)

Both use a pointer to a pointer to a buffer (ie: a char**) to keep track of the current position. The first example simply reads in tokens that are separated by one or more of a certain character (usually a space).

Code:

/* return values: 0 : success -1 : done > 0 : buffer too small, need this many bytes for next token. */ int token( char * buffer, int max, char ** next, char sep) { const char * ptr, * start = NULL; for(ptr = *next; *ptr; ++ptr) { if(*ptr != sep) { if(start == NULL) { start = ptr; // first char in token } } else { if(start != NULL) { break; // ready to copy } } } if(start == NULL) { return -1; // done } int lcopy = ptr-start; if(lcopy > max - 1) // max - 1 for null-term { return lcopy + 1; // need a buffer this big } strncpy(buffer, start, lcopy); buffer[lcopy] = 0; *next = (char*)ptr; return 0; }

It's usage would be like this:

Code:

int main() { const int bufsz = 1024; char buf[bufsz]; char data[] = "This data...needs to be parsed "; char * iter = data; while(0 == token(buf, bufsz, &iter, ' ')) { printf("Token: '%s'.\n", buf); } return 0; }

The next one takes a slightly different approach by skipping over anything that doesn't match a certain 'token-type' string. It uses two helper functions, find_first_of and find_first_not_of in order to accomplish that.

Code:

const char * find_first_of( const char * str, const char * find) { const char * ptr; for( ; *str; ++str) { for(ptr = find; *ptr; ++ptr) { if(*str == *ptr) { return str; } } } return NULL; } const char * find_first_not_of( const char * str, const char * find) { const char * ptr = find; bool found; for( ; *str; ++str) { found = false; for(ptr = find; *ptr; ++ptr) { if(*str == *ptr) { found = true; } } if(!found) { return str; } } return NULL; } /* return values: 0 : success -1 : done > 0 : buffer too small, need this many bytes for next token. */ int token( char * buffer, int max, char ** next, const char * match) { const char * start = find_first_of(*next, match); if(start == NULL) { return -1; } const char * ptr = find_first_not_of(start, match); if(ptr == NULL) { ptr = &start[strlen(start)-1]; } int lcopy = ptr-start; if(lcopy > max - 1) { return lcopy + 1; } strncpy(buffer, start, lcopy); buffer[lcopy] = 0; *next = (char*)++ptr; return 0; }

The usage for that would be:

Code:

int main() { const int bufsz = 1024; char buf[bufsz]; const char * accept = "abcdefghijklmnopq" "rstuvwxyzABCDEFGH" "IJKLMNOPQRSTUVWXYZ"; char data[] = "This data...needs to be parsed "; char * iter = data; while(0 == token(buf, bufsz, &iter, accept)) { printf("Token: '%s'.\n", buf); } return 0; }

Woah... thanks... I'm going to have to sit down with a cup of coffee and begin attempting to "translate" all of that in my mind... haha...thanks.