String/token parsing

**Mostly Harmless** · 03-04-2008

I'm attempting to write some code that will extract certain tokens from a string of variable length. The code I have now is long and messy (loaded with switch statements, if/thens, etc), and I'm sure there's a more elegant solution. For the sake of brevity, say this is the list of valid tokens:

Code:

A
AZ
@
:...:        //String between colons to be read as a string and not tokenized
{0-9}    //Any single-digit integer)

So, valid strings could be:

Code:

A@5AAZ
@01:Hello, world!:@A
AAAZ:AZ:A1
@7@7AZ

For the most part, the strings will be >255 characters in length, and read from a plain text file. The number of valid tokens is more likeley to be several dozen. Some tokens will be followed by a string literal. The code also needs to be able to ignore anything that isn't a valid token. Order counts.

Are there any tricks to make this easier? Or do I really need to brute force my way through the string? As it stands now, my code is pages of this:

Code:

switch(input_char)
{
case 'A':
    if(next_char == 'Z')
    {
        //do stuff
    }
    else
    {
        //do some other stuff
    }
    break;
//etc.
}

It's functional, but it's pretty ugly. And anytime I need to add/change a token, it's not easy. Any help would be greatly appreciated.

**pheres** · 03-04-2008

Is this an academic or a practical question?

If 2.,

Have a look at boost tokenizer. It does the dirty work for you.
http://www.boost.org/libs/tokenizer/index.html

If you need more advanced parsing, you could try boost spirit
http://spirit.sourceforge.net/

**Mostly Harmless** · 03-04-2008

Originally Posted by pheres

Is this an academic or a practical question?

It's a little bit of each. It's academic in that I like to know how/why things are done (I'm constantly reinventing the wheel). It's practical in that my code is becoming impractical, and I need a better solution (that, and I generally find that my wheel isn't as good as whatever else is out there).

FWIW, this is for a silly project of mine. It's not a homework question.

**pheres** · 03-04-2008

You could intoduce a layer of abstraction. Parsing of regular languages is the elemental job of finite state machines. You have to build one that either just accepts your language or acts according to the read tokens. Google will give you probably more references about parsing and FSMs than you want to read.
The code for your FSM may be messy as well, but it's hidden inside a class (or class system with own classes for actions, states, transitions and so on). Or you could make the FSM data driven. For example you could describe needed states and transitions and actions (which are IDs of functions you register) in XML and build just a loader and executer inside your app. But what would be a bit of overkill for a little project.

Thread: String/token parsing

Thread Tools

Search Thread

Display

String/token parsing

Similar Threads

need sth about parsing

draw tree graph of yacc parsing

Parsing for Dummies

Need help fixing bugs in data parsing program

I hate string parsing with a passion