Thread: String Tokenizer Help

  1. #1
    Registered User
    Join Date
    Jan 2013
    Posts
    3

    String Tokenizer Help

    It's due for a bit, but a little while ago I was assigned a String Tokenizer assignment. Here is what the assignment asks of me:

    Tokenizer (Name: tokenizer.c, tokenizer.h, tokenizerTester.c)

    Using tokenizer.h and tokenizerTester.c as a guide, write two functions startToken and getNextToken that are used to process tokens from a given input file.

    The startToken() function takes as input a single line of text and does any internal workings needed to begin tokenizing it. Each call to getNextToken() returns the next token (as a string) in the line. If the line is complete (there are no more tokens) EOL is returned.

    Tokens are defined as follows:
    • A token is a collection of continuous non-whitespace characters.
    • Whitespace is defined as space (’ ’), tab (’\t’) or newline (’\n’)
    Everything else is considered non-whitespace.
    • String Tokens start with a double (”) or single (’) quote then continue to the matching double or single
    quote or the end of the line.

    If the end of the line is reached (in a string) before the matching quote is found (an unterminated string), an error should be returned. The Token returned should not include the quotes. The ending quote is considered a delimiter (an end to the token). Thus,

    Hi "there how" are you
    Has 3 tokens: Hi, there how, are, you

    Hi "there how"are you
    Also has 3 tokens: Hi, there how, are, you

    Hi there"how are"you
    Has 2 tokens (odd!!!)
    Hi, there"how, are"you

    Because quotes only mean something at the START of a token!!!
    We take this approach for SIMPLICITY really!

    • The return type is a struct containing two values, the start of the string and the type of token (or eol
    or error).

    NOTE: This is a very very simplistic tokenizer but will do for our basic needs for now. Your code should run with the provided test code: tokenizerTester.c (leave this file untouched). Included in the zip file are some sample input and output pairings. Thus, running: tokenizerTester < tokenInput.0 should output exactly like: tokenOutput.0. Be sure to submit the unmodified tokenizer.h and tokenizerTester.c files as well.
    I understand what the assignment is asking me to do. And what the program should do. But I have very little idea of how to go through with it. I've done a handful of different attempts, getting mostly different results. I've started, cleared it all off, restarted too many times to count, and now I'm beyond frustrated. We're given a tester file, a header file, and asked to write the tokenizer.c file.

    tokenizer.h :
    Code:
    /******
     * Christian Duncan
     * Tokenizer:
     *    This collection of functions takes as input a single line of text
     *    and tokenizes that text.  Each call to getNextToken() returns the
     *    next token (as a string) in the line.  If the line is complete EOL
     *    is returned.
     *
     * Tokens are defined as follows:
     *    A collection of continuous non-whitespace characters.
     *
     *    Whitespace:
     *       Is defined as space (' '), tab ('\t') or newline ('\n')
     *       Everything else is considered non-whitespace.
     *
     *    String Tokens:
     *       If the token starts with a double (") or single (') quote then
     *       the token continues to the matching double or single quote or
     *       the end of the line.
     *       If the end of the line is reached (in a string) before the matching
     *       quote is found (unterminated string), an error should be returned.
     *       The Token returned should not include the quotes.
     *       The ending quote is considered a delimiter (an end to the token)
     *       Thus,
     *          Hi "there how" are you
     *             Has 3 tokens: Hi, there how, are, you
     *          Hi "there how"are you
     *             Also has 3 tokens: Hi, there how, are, you
     *          Hi there"how are"you
     *             Has 2 tokens (odd!!!)
     *                Hi, there"how, are"you
     *             Because quotes only mean something at START of token!!!
     *             We take this approach for SIMPLICITY really!
     *       
     * NOTE: 
     *    This is a very very simplistic tokenizer but will do for our basic needs
     *    for now.
     *
     *******/
    
    #ifndef __TOKENIZER_H
    #define __TOKENIZER_H
    /***
     * A token: storing start of the token string
     *  and the type of the token.
     ***/
    typedef struct {
      char *start;
      enum { BASIC, SINGLE_QUOTE, DOUBLE_QUOTE, EOL, ERROR } type;
    } aToken;
    
    /***
     * startToken:
     *    Register the start of a new line to tokenize.
     *    The previous line (if still present) gets ignored.
     *    An error is printed if the line is NULL (but treated as an empty line)
     *
     *    line: A pointer to the start of the null-terminated string for this line.
     *          The string gets stored in a local copy so the string line can change
     *          without affecting the tokenizer.  Also, line is not altered in any way.
     ***/
    void startToken(char *line);
    
    /***
     * getNextToken:
     *    Return the next token in the current line as a struct (aToken).
     *    String tokens are handled as described above.
     *
     *    Returns aToken.type of: 
     *      EOL: If end-of-line reached
     *      ERROR: If some error occurred (namely, unterminated string)
     *      BASIC: If token is a regular token
     *      SINGLE_QUOTE: If token is 'single quoted string'
     *      DOUBLE_QUOTE: If token is "double quoted string"
     *
     *    Returns aToken.start:
     *      If not EOL or ERROR, then start points to start of the string
     *      (and string is null-terminated)
     *
     **********************************************
     *    WARNING: This start string is ONLY temporary.  A subsequent call to 
     *      getNextToken/startToken will possibly erase it.  So caller MUST
     *      make a local copy if further use is needed!
     **********************************************
     ***/
    aToken getNextToken();
    
    #endif  /* __TOKENIZER_H */
    tokenizerTester.c :
    Code:
    /*******
     * Christian Duncan
     *
     * TokenizerTester
     *   This program simply reads in a bunch of lines from stdin
     *   and calls the tokenizer functions to print out each token
     *   Lines cannot be longer than MAX_LINE_LENGTH
     ********/
    
    #include "tokenizer.h"
    #include <stdio.h>
    
    #define MAX_LINE_LENGTH 200
    
    int main(int argc, char* argv[]) {
      char line[MAX_LINE_LENGTH+1];
    
      while (fgets(line, MAX_LINE_LENGTH+1, stdin) != NULL) {
        // We have our current line
        startToken(line);
        aToken answer;
        answer = getNextToken();
        while (answer.type != EOL) {
          switch (answer.type) {
          case ERROR:
        // Error (for some reason)
        printf("Token: ERROR\n");
        break;
          case BASIC:
        // Regular token
        printf("Token:  BASIC: %s\n", answer.start);
        break;
          case SINGLE_QUOTE:
        printf("Token: SINGLE: %s\n", answer.start);
        break;
          case DOUBLE_QUOTE:
        printf("Token: DOUBLE: %s\n", answer.start);
        break;
          default:
        fprintf(stderr, "Programming Error: Unrecognized type returned!!!\n");
          }
          answer = getNextToken();
        }
      }
    
      // Everything ran smoothly
      return 0;
    }
    Those are not supposed to be modified. However, the tokenizer.c file is up to us to write. And it should only contain two functions. (At least, that's how the teacher makes it sound. This is what I have in tokenizer.c :
    Code:
    #include "tokenizer.h"
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    char *token;
    int startOfToken = 0;
    
    void startToken(char *line)
    {
        token = malloc(sizeof(char) * strlen(line) + 1);
        int i;
        for (i = 0; i < strlen(line); i++)
        {
            token[i] = line[i];
        }
        token[strlen(line)] = '\0';
    }
    
    aToken getNextToken()
    {
        aToken t;
        t.start = malloc(sizeof(char) * strlen(token) - 1);
        int i;
        if (token[startOfToken] == '\"')
            t.type = DOUBLE_QUOTE;
        else if (token[startOfToken] == '\'')
            t.type = SINGLE_QUOTE;
        else
            t.type = BASIC;
        if (t.type == DOUBLE_QUOTE)
        {
            for (i = startOfToken + 1; i < strlen(token); i++)
            {
                if (token[i] == '\"')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0' || token[i] == '\n')
                {
                    t.type = ERROR;
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
        else if (t.type == SINGLE_QUOTE)
        {
            for (i = startOfToken + 1; i < strlen(token); i++)
            {
                if (token[i] == '\'')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0' || token[i] == '\n')
                {
                    t.type = ERROR;
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
        else
        {
            for (i = startOfToken; i < strlen(token); i++)
            {
                if (token[i] == ' ' || token[i] == '\n' || token[i] == '\t')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0')
                {
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
    
    }
    The input file looks like this:
    A very simple test
    Let us try regular tokens first.
    And then a few with string quotes.
    After, we shall have a different file for more trickier
    tests.
    Bill once said, "There is nothing better I like than p, b, & j sandwiches for lunch."
    But really? Is there "nothing better?"
    Not even a 'veggie quesadilla?'
    A lot of what he is asking confuses me greatly, and I'm not entirely sure how to proceed. This code is able to return something, but right now, when I compile (with gcc) and execute, I get an infinite loop that tokenizes gibberish and prints it out infinite times.
    Token: BASIC: 1�I��^H��H���PTI��0@
    Token: BASIC: 1�I��^H��H���PTI��0@
    Token: BASIC: 1�I��^H��H���PTI��0@
    I don't know where it's getting this from. I don't like working with C very much, nor am I extremely well versed in it. But I need some guidance in order to finish the assignment soon. I need to know what I'm doing wrong, how to fix it, etc. I've been at this assignment for close to two weeks now, I'm getting a bit tired of it.

    EDIT: It is my understanding that the first function works fine. (At least that's what the TA told me) It's the second one that has all kinds of issues.
    Last edited by KuuKuu; 01-21-2013 at 02:43 PM.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    Well you have this
    while (answer.type != EOL)

    But no path through getNextToken() ever sets the EOL state.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Jan 2013
    Posts
    3
    Code:
    #include "tokenizer.h"
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    
    char *token;
    int startOfToken = 0;
    
    void startToken(char *line)
    {
        token = malloc(sizeof(char) * strlen(line) + 1);
        int i;
        for (i = 0; i < strlen(line); i++)
        {
            token[i] = line[i];
        }
        token[strlen(line)] = '\0';
    }
    
    aToken getNextToken()
    {
        aToken t;
        t.start = malloc(sizeof(char) * strlen(token) - 1);
        int i;
        if (token[startOfToken] == '\"')
            t.type = DOUBLE_QUOTE;
        else if (token[startOfToken] == '\'')
            t.type = SINGLE_QUOTE;
        else if (token[startOfToken] == '\0' || token[startOfToken] == '\n')
            t.type = EOL;
        else
            t.type = BASIC;
        if (t.type == DOUBLE_QUOTE)
        {
            for (i = startOfToken + 1; i < strlen(token); i++)
            {
                if (token[i] == '\"')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0' || token[i] == '\n')
                {
                    t.type = ERROR;
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
        else if (t.type == SINGLE_QUOTE)
        {
            for (i = startOfToken + 1; i < strlen(token); i++)
            {
                if (token[i] == '\'')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0' || token[i] == '\n')
                {
                    t.type = ERROR;
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
        else
        {
            for (i = startOfToken; i < strlen(token); i++)
            {
                if (token[i] == ' ' || token[i] == '\n' || token[i] == '\t')
                {
                    t.start[i - startOfToken] = '\0';
                    startOfToken = i + 1;
                    return t;
                }
                else if (token[i] == '\0')
                {
                    return t;
                }
                else
                {
                    t.start[i - startOfToken] = token[i];
                }
            }
        }
    
    }
    I added two lines to set it, provided that there was a null character or newline character there. However, the results came out identical, and I'm not really sure where to set it, or what exactly I'm supposed to do there. Especially since it didn't seem to make a difference. :/

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    foo.c:232:1: warning: control reaches end of non-void function [-Wreturn-type]

    > for (i = startOfToken; i < strlen(token); i++)
    If this is false the first time you reach it, the body of the loop never executes, and you return garbage when you fall off the end of the function.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Jan 2013
    Posts
    3
    Alright, that clearly made some difference. I added a "return t" statement at the end of the else bracket, and it no longer prints out garbage. It does, however, print out "Token: Basic: " on new lines infinitely. I'm not entirely sure how to get it to stop that.

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    Well you could run the code in the debugger, then start single-stepping through the code.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. String tokenizer and delimiters
    By John_L in forum C Programming
    Replies: 5
    Last Post: 11-06-2007, 07:22 PM
  2. string tokenizer
    By mbooka in forum C Programming
    Replies: 4
    Last Post: 02-15-2006, 06:00 PM
  3. C++ String Tokenizer
    By Annorax in forum Game Programming
    Replies: 10
    Last Post: 07-13-2005, 10:41 AM
  4. String/Tokenizer problem.
    By MipZhaP in forum C++ Programming
    Replies: 14
    Last Post: 01-29-2005, 05:28 PM
  5. Tokenizer in C
    By Tarik in forum C Programming
    Replies: 21
    Last Post: 08-26-2004, 06:36 AM