C program tokenizer.

**Mr.Lnx** · 11-28-2013

Hello to all. I have created a program which does tokenization to the whole line. The basic idea is if you have this ***HELLO%%SIR.

Then you have 8 tokens (2 words + 6 symbols). My algorithm is the following :

BEGIN OF THE LOOP

IF CHARACTER IS NOT AN ALPHABETIC LETTER

INCREASE THE COUNTER OF ITEMS ONE TIME
INCREASE THE COUNTER OF ARRAY ONE TIME

END IF

OTHERWISE IF CHARACTER IS AN ALPHABETIC LETTER

INCREASE THE COUNTER OF WORDS ONE TIME
INCREASE THE COUNTER OF ARRAY TO THE LENGTH OF THE WORD

END OTHERWISE IF

IF CHARACTER IS THE LAST
BREAK FROM THE LOOP
END IF

END LOOP

PRINT ITEMS_COUNTER + WORD_COUNTER

And here is the implementation :

[C] C Words Tokenizer - Pastebin.com

First of all it is not an exercise from university. It is my reflection. What is your opinion about that exercise? It is useful? Tokens must be only the words between delimiters? I am analyzing the whole line.

Secondly I want to print the symbols and the words I don't want to use strtok to do this because I think that function line_tokenizer will not reusable due to strtok and the fact that puts the '\0' after the word each time it is called. (strtok) Is there any other idea in order to have the OUTPUT :

Code:

 

Give the sentence: ***HELLO%%SIR.

          Analyzing...

The line has 8 token(s).

Symbols : ***%%.
Words : HELLO  SIR

Any other suggestion for the program or for the algorithm will be acceptable.

Thank you in advance

**Adak** · 11-28-2013

What about numbers (digits)?

**Mr.Lnx** · 11-28-2013

Originally Posted by Adak

What about numbers (digits)?

So far the program manipulates the numbers as symbols.

**Adak** · 11-28-2013

So the band Blink123 becomes just "Blink", and 3 symbols??

How about if the number is a prefix or suffix a word, then it's part of the name, else it's a symbol?

**Mr.Lnx** · 11-28-2013

Originally Posted by Adak

So the band Blink123 becomes just "Blink", and 3 symbols??

How about if the number is a prefix or suffix a word, then it's part of the name, else it's a symbol?

Hmmm good testing. I didn't know about this input issue. According to the documentation 123Blink and Blink123 must give the same results. I will see what goes wrong with this. :/

May I should fix the program in order to manipulates numbers too. For example

Code:

 Hello_How_are_you?123 

OUTPUT : 11 tokens or 9 (the whole 123)

Do you think this exercise it is useful? In order to continue with it?

**Adak** · 11-28-2013

Any little problem or puzzle that you find interesting and challenging, should be at least given a try. There isn't enough time to go through things that don't interest or challenge you.

Make sure the "juice" is worth the "squeeze", but you need to "squeeze" things, or you'll never develop adequate hand strength in programming.

Thread: C program tokenizer.

Thread Tools

Search Thread

Display

C program tokenizer.

Similar Threads

String Tokenizer Help

Is this a bug of boost::tokenizer ?

C++ String Tokenizer

Tokenizer in C

Tokenizer