-
Parsing for Dummies
Okay— yes, I have checked every resource of cprogramming.com, and yet I am still confused about the concept of string parsing, and/or separating strings into tokens. It isn't that there isn't enough information; I just haven't found any info that's "dumbed-down" enough for someone who is completely new to parsing.
So I know what parsing is.
But what I'm wondering is, what are the rudimentary basics of parsing and string separation, in terms of programming, and how they work logically?
Many thanks, I know that this is a bit vague, I tried to focus my question as well as I could.
-
It depends on what your looking for. If you want to take the string "This is a string. This is another part of this string." And separate it into two strings (one for each sentance) you would iterate until you found a period. If you wanted to parse all words out of a string into multiple strings you could use pointers to iterate until they found a non alpha-numerical character, take the difference between the two pointers, and copy the data into a new string.
There are multiple ways of parsing strings, and all of them rely on some form of iteration.
-
In my mind parsing involves taking a given original string and looking for a given set of targets--which may be substrings or single characters. The full string is either physically or logically subdivided into resultant substrings (aka tokens) using some protocol if a target is found.
-
Here's an example of two very rudimentary parsing implementations to illustrate the basic concept. I hope it helps. :)
Both use a pointer to a pointer to a buffer (ie: a char**) to keep track of the current position. The first example simply reads in tokens that are separated by one or more of a certain character (usually a space).
Code:
/*
return values:
0 : success
-1 : done
> 0 : buffer too small, need this
many bytes for next token.
*/
int token(
char * buffer,
int max,
char ** next,
char sep)
{
const char * ptr, * start = NULL;
for(ptr = *next; *ptr; ++ptr)
{
if(*ptr != sep)
{
if(start == NULL)
{
start = ptr; // first char in token
}
}
else
{
if(start != NULL)
{
break; // ready to copy
}
}
}
if(start == NULL)
{
return -1; // done
}
int lcopy = ptr-start;
if(lcopy > max - 1) // max - 1 for null-term
{
return lcopy + 1; // need a buffer this big
}
strncpy(buffer, start, lcopy);
buffer[lcopy] = 0;
*next = (char*)ptr;
return 0;
}
It's usage would be like this:
Code:
int main()
{
const int bufsz = 1024;
char buf[bufsz];
char data[] = "This data...needs to be parsed ";
char * iter = data;
while(0 == token(buf, bufsz, &iter, ' '))
{
printf("Token: '%s'.\n", buf);
}
return 0;
}
The next one takes a slightly different approach by skipping over anything that doesn't match a certain 'token-type' string. It uses two helper functions, find_first_of and find_first_not_of in order to accomplish that.
Code:
const char * find_first_of(
const char * str,
const char * find)
{
const char * ptr;
for( ; *str; ++str)
{
for(ptr = find; *ptr; ++ptr)
{
if(*str == *ptr)
{
return str;
}
}
}
return NULL;
}
const char * find_first_not_of(
const char * str,
const char * find)
{
const char * ptr = find;
bool found;
for( ; *str; ++str)
{
found = false;
for(ptr = find; *ptr; ++ptr)
{
if(*str == *ptr)
{
found = true;
}
}
if(!found)
{
return str;
}
}
return NULL;
}
/*
return values:
0 : success
-1 : done
> 0 : buffer too small, need this
many bytes for next token.
*/
int token(
char * buffer,
int max,
char ** next,
const char * match)
{
const char * start = find_first_of(*next, match);
if(start == NULL)
{
return -1;
}
const char * ptr = find_first_not_of(start, match);
if(ptr == NULL)
{
ptr = &start[strlen(start)-1];
}
int lcopy = ptr-start;
if(lcopy > max - 1)
{
return lcopy + 1;
}
strncpy(buffer, start, lcopy);
buffer[lcopy] = 0;
*next = (char*)++ptr;
return 0;
}
The usage for that would be:
Code:
int main()
{
const int bufsz = 1024;
char buf[bufsz];
const char * accept = "abcdefghijklmnopq"
"rstuvwxyzABCDEFGH"
"IJKLMNOPQRSTUVWXYZ";
char data[] = "This data...needs to be parsed ";
char * iter = data;
while(0 == token(buf, bufsz, &iter, accept))
{
printf("Token: '%s'.\n", buf);
}
return 0;
}
-
Woah... thanks... I'm going to have to sit down with a cup of coffee and begin attempting to "translate" all of that in my mind... haha...thanks.