Thread: Critique my StringTokenizer please...

  1. #1
    Registered User
    Join Date
    Jun 2004
    Posts
    124

    Critique my StringTokenizer please...

    Hi all, this is a string tokenizer that I wrote in C++ quite a while ago, I wrote it because I hate strtok, I hate strtok because it requires me to make multiple calls to get all the tokens. The point of this function is that it returns an array of strings each containing a token. The code is fully documented, what i'm looking for here is basically a code critique, if anyone can spot any flaws or possible memory leaks I would love to hear about them, hope someone can help me out, thanks all!

    Code:
    char** StringTokenizer(const char* pString, const char* pDelimiters, int& nTokenCount)
    {
    	char** pSuperString;	//the array of strings that holds everything
    	int nInputLength = StringLength(pString);	//get length of string to be tokenized
    	int nNumDelimeters = StringLength(pDelimiters);	//get how many delimiters there are
    	int nSuperString = 0;	//used as a counter
    	int nString = 0;	//used as a counter
    	char* pMyInput = new char[nInputLength];	//just a copy of pString so as not to modify original string
    	
    	StringCopy(pMyInput, pString);
    	
    	for (int nInput = 0; nInput < nInputLength; nInput++)	//insert spaces where the tokens are (acts as a placeholder)
    	{
    		for (int nToken = 0; nToken < nNumDelimeters; nToken++)	
    		{
    			if (pMyInput[nInput] == pDelimiters[nToken]) 
    			{ 
    				pMyInput[nInput] = ' ';
    			}
    		}
    		nToken = 0;
    	}
    
    	for (int nCount = 0; nCount < nInputLength; nCount++)	//count number of tokens (spaces)
    	{
    		if (pMyInput[nCount] == ' ' && pMyInput[nCount+1] != ' ') 
    		{ 
    			nTokenCount++;
    		}
    	}
    	
    	nTokenCount++;
    	pSuperString = new char*[nTokenCount];	//number of tokens (+1) = number of strings
    
    	for (int nSString = 0; nSString < nTokenCount; nSString++)	
    	{
    		pSuperString[nSString] = new char[50];	//go through an assign memory to each string within the superstring
    		StringCopy(pSuperString[nSString], "");
    	}
    
    	for (int nCounter = 0; nCounter < nInputLength; nCounter++)
    	{
    		if (pMyInput[nCounter] != ' ')
    		{
    			StringCopy(pSuperString[nSuperString][nString], pMyInput[nCounter]);	//copy substrings into the superstring
    			nString++;
    		}
    		else if (pMyInput[nCounter+1] != ' ')
    		{
    			nSuperString++;
    			nString = 0;
    		}
    	}
    	
    	return pSuperString;
    }

  2. #2
    Its not rocket science vasanth's Avatar
    Join Date
    Jan 2002
    Posts
    1,683
    ok

    Code:
         nToken = 0;
    Is out of scope....


    Code:
         nTokenCount++;
    what is it initilized to.. What happens if the user passes this after initilizin this to say 1000 it will screw up the below part

    Code:
         	for (int nSString = 0; nSString < nTokenCount; nSString++)


    and the way it tokenizes dosent seem right... if the input it
    "how are you" the output is
    "how are you"
    "are you"
    "you"

    but maybe thats what you wanted...


    Code:
         	char* pMyInput = new char[nInputLength];
    what happens to this.. where is this memory freed... and this cannot be freed outside the function because there is no visibility from outside...

    and where is your custom StringCopy function...
    I am sure that might have some memory leaks...

    andwhen you allocate memory using new.. what happens if the system run of memory and returns NULL instead...

    and what happens if I do not want to tokenize the string based on space but on tab... your function delibrately introduces space and considers all spaces as delimeters

    Code:
       for (int nCount = 0; nCount < nInputLength; nCount++)	//count number of tokens (spaces)
       	{
       		if (pMyInput[nCount] == ' ' && pMyInput[nCount+1] != ' ')
    now you are checking nCount+1 which at the end of the for loop will be nInputLength so you are accessing
    pMyInput[nCount+1] which at the end is actually pMyInput[nInputLength] whose memory space does not belong to your function... this might cause the program to crash in the worst case...


    Code:
      	nTokenCount++;
     	pSuperString = new char*[nTokenCount];	//number of tokens (+1) = number of strings
    what happens if the string contained no delimeter character to tokenize... you still allocate memory making it inefficient..

    and not to talk about the inefficiency of the algorithm.. we will keep that for another session after you correct your code
    Last edited by vasanth; 11-24-2004 at 04:48 PM.

  3. #3
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Just a thought: Since this is C++, why not have it return a std::vector<std::string>? Or, depending on what you need, you could have it return a std::list or std::deque or whatever (make it a template function).

    I haven't tested this so I don't know if it works, but:
    Code:
    template <class T>
    T<std::string> tokenize(char* str, const char* delims)
    {
       T<std::string> container;
       char* tok = strtok(str, delims);
    
       while(tok != NULL)
       {
          container.push_back(tok);
          tok = strtok(NULL, delims);
       }
       return container;
    }
    Or, of course, you could modify it to do the tokenizing manually rather than just wrapping around the strtok() function.
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  4. #4
    Registered User
    Join Date
    Jun 2004
    Posts
    124
    You're right, the first line is not needed.
    The second problem you mentioned, the token count variable is just used as a reference so that the caller of the function can get the number of tokens.
    As for the spaces thing, I think you misunderstand. What the tokenizer tokenizes around is passed in as a variable called const char* pDelimiters, the spaces are used in place of the delimeters so that it's easier to process further down the line, you can tokenize on whatever you want, you can even tokenize on multiple tokens if you pass in more than one, for instance you could pass in "at" as a token, and the function would tokenize on 'a' and on 't'.
    If the function is invoked without any delimters....well....why was it invoked in the first place?

  5. #5
    Registered User
    Join Date
    Jun 2004
    Posts
    124

    StringCopy and StringLength

    Here are my StringCopy and StringLength functions which are used in my tokenizer

    Code:
    void StringCopy(char* pDestination, const char* pSource)
    {
    	int nCounter = 0;
    	int nNewLength = StringLength(pSource);
    		
    	while (nCounter < nNewLength)
    	{
    		pDestination[nCounter] = pSource[nCounter];
    		nCounter++;
    	}
    
    	pDestination[nCounter] = '\0';
    }
    Code:
    int StringLength(const char* pString)
    {
    	int nCounter = 0;
    	while (pString[nCounter] != NULL) { nCounter++;	}
    	return nCounter;
    }

  6. #6
    Its not rocket science vasanth's Avatar
    Join Date
    Jan 2002
    Posts
    1,683
    how about if i pass the delimeter as @ and the string as
    "hello my mail is is [email protected]"

    your function takes those spaces are delim as well thought it was not in the list of delimeters i sent..

    and regarding tokencount
    Code:
      nTokenCount++;
     	pSuperString = new char*[nTokenCount];	//number of tokens (+1) = number of strings
     
     	for (int nSString = 0; nSString < nTokenCount; nSString++)	
     	{
     		pSuperString[nSString] = new char[50];	//go through an assign memory to each string within the superstring
     		StringCopy(pSuperString[nSString], "");
     	}
    Well youshould initilize it to 0 at the begining of the function.... when you say

    Code:
       nTokenCount++;
    what is it incrementing from.... if the use had initilized the variable to 1000 before passing it to your function then
    Code:
      for (int nSString = 0; nSString < nTokenCount; nSString++)	
     	{
    		pSuperString[nSString] = new char[50];	//go through an assign memory to each string within the superstring
     		StringCopy(pSuperString[nSString], "");
     	}
    you will be trying to access pSuperString[1000] etc leading to segmentation faults...
    Last edited by vasanth; 11-24-2004 at 04:57 PM.

  7. #7
    Registered User
    Join Date
    Jun 2004
    Posts
    124
    Hrmm....you're right, I failed to notice that, spaces will be delimters whether they are wanted or not...

    Is there a better way to tokenize? Can someone give me some example code to work from? I wrote this from scratch.

  8. #8
    Its not rocket science vasanth's Avatar
    Join Date
    Jan 2002
    Posts
    1,683
    Quote Originally Posted by maxhavoc
    Hrmm....you're right, I failed to notice that, spaces will be delimters whether they are wanted or not...

    Is there a better way to tokenize? Can someone give me some example code to work from? I wrote this from scratch.
    why dont you have a look how strtok works internally... you can write a wrapper function arounf it which would call it as many times as required and return the array you need with all the tokenized string...

  9. #9
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    >>Here are my StringCopy and StringLength functions which are used in my tokenizer
    I'd just like to point out, there are standard library functions that do the exact same thing

    Code:
    #include <cstring>
    
    int len;
    //len = StringLength(theString);
    len = strlen(theString);
    
    //StringCopy(newString, theString);
    strcpy(newString, theString);
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  10. #10
    Carnivore ('-'v) Hunter2's Avatar
    Join Date
    May 2002
    Posts
    2,879
    Here, I typed this up just now though I haven't tested it extensively.
    Code:
    bool isDelim(char c, const std::string& delims)
    {
       for(std::string::const_iterator it = delims.begin(); it != delims.end(); ++it)
       {
    	  if(*it == c)
    		 return true;
       }
       return false;
    }
     
    std::vector<std::string> tokenize(const std::string& string, const std::string& delims)
    {
       std::vector<std::string> container;
       int lastPos = 0;
       int pos = 0;
     
       while(pos < string.size())
       {
    	  while(pos < string.size() && isDelim(string[pos], delims))
    		 ++pos;
    	  if(pos == string.size())
    		 break;
    	  lastPos = pos;
    	  while(pos < string.size() && !isDelim(string[pos], delims))
    		 ++pos;
      
    	  container.push_back(string.substr(lastPos, pos - lastPos));
       }
       return container;
    }
    It should act in the same way that strtok() does, except that it doesn't modify the original string and it returns all tokens in a single vector of strings.

    The concept is, you ignore all of the delimiters in the front; once you've found the first non-delimiter, mark the position, and loop until you find a delimiter. Then you copy all characters from the initial marker up to your current position into the vector of tokens. Now you go back the first part, and keep going until you find the first non-delimiter. Mark it, find the next delimiter and copy everything in between, etc.

    Note that I took the tokenCount argument out, since tokenCount is simply equal to the vector's size. Also note that I added tests for the end of the string in each of the nested loops, to prevent array out-of-bounds.
    Last edited by Hunter2; 11-24-2004 at 05:52 PM. Reason: Bloody editor keeps killing my indentation...
    Just Google It. √

    (\ /)
    ( . .)
    c(")(") This is bunny. Copy and paste bunny into your signature to help him gain world domination.

  11. #11
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    Quote Originally Posted by Hunter2
    >>Here are my StringCopy and StringLength functions which are used in my tokenizer
    I'd just like to point out, there are standard library functions that do the exact same thing

    Code:
    #include <cstring>
    
    int len;
    //len = StringLength(theString);
    len = strlen(theString);
    
    //StringCopy(newString, theString);
    strcpy(newString, theString);
    And in most cases already compiled, and assembly optimized.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Linked List question from a newb(Need code Critique)
    By Gatt9 in forum C++ Programming
    Replies: 2
    Last Post: 01-08-2006, 03:11 AM
  2. Critique my program
    By Jason Spaceman in forum C Programming
    Replies: 6
    Last Post: 10-26-2004, 05:38 PM
  3. Critique my code?
    By bigbadbowlindud in forum C++ Programming
    Replies: 20
    Last Post: 06-24-2004, 12:54 PM
  4. could ya'll critique this code
    By linuxdude in forum C Programming
    Replies: 0
    Last Post: 02-19-2004, 09:55 PM
  5. Seeking Critique about my prog!
    By Yevmaster in forum C++ Programming
    Replies: 3
    Last Post: 05-26-2003, 01:04 PM