Thread: An oversimplified tokenizer

  1. #1
    Registered User
    Join Date
    May 2020
    Posts
    23

    An oversimplified tokenizer

    Everyone,

    Since I'm not a C expert, please criticize this program. I emailed to a friend fluent in that language. So, he said "Nice!" and told me how optimize the program slightly by replacing a call to the strlen function with the body of the "nonempty" function.

    The program mimics the strtok function without looking for characters that separate tokens.

    Thanks for your thoughts. The program won't seem too amateurish, I hope. I'm fluent in ISO Standard Pascal, though I'd rather write in C instead.

    Code:
    #define MAX 80
    #include <ctype.h>
    #include <stdio.h>
    #include <string.h>
    #include <stdbool.h>
    
     bool token_may_contain(const int c)
    {
        return isalnum(c) || index("-'+", c) != NULL;
    
    }
    
    bool nonempty(const char some_string[])
    {
        return some_string[0] != '\0';
    }
    
    /* Return a token from the line if there's at least one. Otherwise, set the stating topint to -1 */
     const char *a_token_from(const char line[], int * start)
    {
        static char  token[MAX + 1] = "";
        register int here = *start, token_length = 0, line_length = strlen(line);
    
        while (here < line_length && !token_may_contain(line[here]))
            here++;
        while (token_may_contain(line[here]))
            token[token_length++] = line[here++];
        if (nonempty(token))
            token[line_length] = '\0';
        *start = (here < line_length - 1) ? here : -1;
        return token;
    }
    
     void tokenize(const char line[])
    {
        register int counter = 0;
        int start = 0;
        char token[MAX + 1];
    
        while (start >= 0)
        {
           puts(a_token_from(line, &start));
           counter++;
        }
    }
    
    int main(void)
    {
        char line[MAX];
    
       while (fgets(line, MAX, stdin) != NULL)
       {
           puts("Please type the string to tokenize.");
           tokenize(line);    
       }
        
       return 0;
    }
    Last edited by BillMcEnaney; 12-23-2023 at 02:27 AM.

  2. #2
    Registered User
    Join Date
    Dec 2017
    Posts
    1,664
    Here is some input/output for my first try of your program:
    Code:
    hello there this is a string
    Please type the string to tokenize.
    hello
    there
    thise
    isise
    asise
    string
    The first thing to note is that your prompt doesn't appear until after I enter the line of input!
    Even worse, it doesn't give the correct output.
    Remember to test your program!
    I'll let you try to fix that (both simple), but here are some criticisms of the coding:

    The register keyword is pretty much pointless. The compiler will do the right thing without the "hint".

    index is a non-standard function, and you should include <strings.h> to use it. You should have received a warning about that (if you are using gcc, raise the warning level with -Wall -Wextra -Wpedantic). I would just use the commonly available function strchr, which is essentially identical. Also NULL is false while non-NULL is true, so you don't need the explicit comparison to NULL.
    Code:
    bool token_may_contain(int c)
    {
        return isalnum(c) || strchr("-'+", c);
    }
    Your tokenize function doesn't use token. You should get a warning about that, too. You should pay attention to warnings and fix them. Also, counter has no point. Maybe you meant to return it?

    MAXLINE is a better name than MAX, and we would generally define it after the includes. A MAXLINE of 80 is pretty stingy.
    All truths are half-truths. - A.N. Whitehead

  3. #3
    Registered User
    Join Date
    May 2020
    Posts
    23
    John.c,

    Thank you for finding my mistakes.

    Your computer prompted you at the wrong time because I called gets in a while condition when the function belonged instead in the loop's body. I'll delete token from the tokenize function and rename MAX to "MAXLINE".

    You're the second expert to tell me that I don't need to declare register variables. But I still declare them for two reasons. First, I may need a computer to use them if it runs an old C compiler. Second, since I want my programs' meanings to be clear to anyone who can understand my code. After all, I'm almost obsessed with writing very readable programs.

    Oops, I thought index was a standard function. I'll replace it with strrchr.

    Years ago, I stopped programming professionally. So only I care what programming languages I use. Since functional programming is my favorite kind of programming, someday I'll always write in Haskelll and OCaml.

  4. #4
    Registered User
    Join Date
    Apr 2021
    Posts
    148
    You mention being fluent in Pascal, and perhaps that is where some of your style comes from. I have some suggestions unrelated to the actual subject of your question. Consider this snippet of your code:

    Code:
    /* Return a token from the line if there's at least one. Otherwise, set the stating topint to -1 */
     const char *a_token_from(const char line[], int * start)
    {
    Remember that code is for humans.

    The declaration style here is very dense, and even with code highlighting (which rarely works in a useful way) it doesn't get much less dense.

    Here are some things that readers want from a declaration or definition:

    1. They want to be able to find the declaration or definition. "Take me to where this function is defined/declared."

    2. They want to know the order of the parameters.

    3. They want to know the return type.

    4. They want to know the name of the function.

    5. They want to know the exact type of one of the parameters.

    6. They want to know the expectations about the parameters.

    7. They want to know the behavior of the function.

    The style you are using is optimized for answering question #3, at the expense of pretty much everything else.

    I suggest you adopt a more readable style, with some objective rules that provide measurably better results:

    - Put the return type on a separate line.
    - Put any non-return-type specifiers on another separate line.
    - Use macros to replace keywords with semantic labels ("PRIVATE" vs. "static")
    - Put the function name at the leftmost column of the line (zero or one, depending on how you count), so that grep "^name" will find it.
    - Put parameters on separate lines.
    - Indent the parameters so they are distinct from the function name.

    Here's an example:
    Code:
    /* Return a token from the line if there's at least one. Otherwise, set the stating topint to -1 
    */
    
    static
     const char *                          // #3
    a_token_from(                      // #1, #4
            const char line[],         // #2, #5
            int * start)
    {
    Notice this addresses #1..#5 nicely. It does not address #6 or #7, but you can do that by expanding your header comment, or adding extra structure to the parameter declarations, or both.

    Now, with that said, let's look for bugs! Oh, here's one:
    Code:
    int main(void)
    {
        char line[MAX];
    
        while (fgets(line, MAX, stdin) != NULL)
        {
            puts("Please type the string to tokenize.");
            tokenize(line);
        }
    
        return 0;
    }
    This reads as "get a line of input, then ask the user for input, then tokenize the input."

    What you probably want is "ask the user, then get the line, then tokenize".

    Being fluent in Pascal, you know about repeat ... until but in C that is called
    do ... while. Worse, the loop depends on the input, which is only available in the
    middle of the function. So really it's something like loop ... if no input break; ... forever.

    There's no C (or Pascal, IIRC) syntax for that. We have to jumble up a bunch of control keywords,
    or "unroll" part of the loop in order to get what we want.

    Code:
        char line[MAX];
    
        puts("Please type the string to tokenize.");
        while (fgets(line, MAX, stdin) != NULL)
        {
            tokenize(line);
            puts("Please type the string to tokenize.");
        }
    Observing that the prompt string never changes, and that you never use
    line except to pass it as a parameter, we can easily convert the prompt-and-get-input
    into a single function call (which you can rename to something shorter!):

    Code:
    char *
    prompt_and_read_line(
        const char *prompt)
    {
        static char buffer[MAX];
    
        fputs(prompt, stdout);
        return fgets(buffer, sizeof (buffer), stdin);
    }
    
    int
    main(void)
    {
        const char *msg = "Please type the string to tokenize: ";
        char * line;
    
        while ((line = prompt_and_read_line(msg)) != NULL)
            tokenize(line);
    }

  5. #5
    Registered User
    Join Date
    May 2020
    Posts
    23
    Aghast,

    Thank you for your detailed reply, especially for how to improve readability. Since my computer indented the program with GNU indent, I hope that program will make my machine to do what you suggest. Sadly, my vision keeps me searching for the best indentation style for Pascal.

    But I'm happy to agree with you on all but one point. Some professors taught me that a function should do only one job. So I hesitate to coin a function name with "and" in it.

    Maybe you wonder why I don't say GNU indent indented my program. Here's my reason. A program is like a recipe. It tells the computer what to do, how to do it and what to do it to. Like a recipe, a program needs someone or something to obey it. So, saying that a program tokenizes a string is like telling me that a recipe bakes a pie. The program controls the computer. But the computer does the work.

    Here's another point you might want to think about. "artificial intelligence" misleads people because computers merely simulate intelligence. At least my friend Dr. Michael Covington we agree with me on that point. Before retiring a new years ago, he ran the AI lab at the University of Georgia. Now you know what nonsense you'll hear from a programmer with a philosophy degree.

    Don't worry. I'm not a clone of John Searle, philosopher who argued that computers only manipulate symbols without attaching any meaning to them. Now you know what nonsense you'll hear from a computer science tutor with a philosophy degree. Here ends the digression.

  6. #6
    Registered User
    Join Date
    May 2020
    Posts
    23
    By the way, everyone, I put the program on my floor and squashed the bugs with my wheelchair.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C program tokenizer.
    By Mr.Lnx in forum C Programming
    Replies: 5
    Last Post: 11-28-2013, 02:27 PM
  2. String Tokenizer Help
    By KuuKuu in forum C Programming
    Replies: 5
    Last Post: 01-21-2013, 04:16 PM
  3. C++ String Tokenizer
    By Annorax in forum Game Programming
    Replies: 10
    Last Post: 07-13-2005, 10:41 AM
  4. Tokenizer in C
    By Tarik in forum C Programming
    Replies: 21
    Last Post: 08-26-2004, 06:36 AM
  5. Tokenizer
    By PJYelton in forum C++ Programming
    Replies: 2
    Last Post: 01-29-2003, 03:01 PM

Tags for this Thread