Thread: Split string up into single words

  1. #1
    Registered User
    Join Date
    Apr 2008
    Posts
    122

    Split string up into single words

    As some of you well know (since I have been asking so many questions in the forum) I am creating an HTML book given a simple text file as input. I am also creating an index of certain terms in the book. Right now I have the book output implemented and am now starting on the index portion. I have this loop right here that takes each line and adds the line to a vector:

    Code:
    Page current;
          current.setPageNum(PageNo);
    
          for(int lineNumber = 0; lineNumber < MAX_LINES_PER_PAGE; lineNumber++)
          {
              string line = book.readLine();
              current.addLine(line);
          }
    
          current.output(out, book.theTitle, pageNo, nextPageNo, prevPageNo);
    Now I need to put a function in here that basically splits the "line" variable up into single words then takes those words and passes them to my index function named "addWord." I have tried using substr and doing line.find(' ') and it gives runtime errors. Any help would be appreciated.

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    It would help quite a bit if you posted your code that actually does the splitting and failing...

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Apr 2008
    Posts
    122
    Okay here is the failing code:

    Code:
    Page current;
          Index create;
          current.setPageNum(PageNo);
    
          for(int lineNumber = 0; lineNumber < MAX_LINES_PER_PAGE; lineNumber++)
          {
              string line = book.readLine();
              current.addLine(line);
              position = line.find(' ');
              create.addWord(line.substr(position), PageNo);
          }
    
          current.output(out, book.theTitle, pageNo, nextPageNo, prevPageNo);

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    What if a page is shorter than MAX_LINES_PER_PAGE? Does readLine fail in a nice way?

  5. #5
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Guessing time!
    Code:
    string line;
    int start = line.find(' ', 0) + 1;
    int end = line.find(' ', start);
    string word = line.substr(line, start, end - start);
    to get a word from the beginning. My guess it that you foget the +1? Or you put end as the second parameter in substr?

    Here is a more C way (if you prefer):
    Code:
    char* cline = line.c_str();
    string word("");
    vector<string> words;
    bool start = false;
    for (int i = 0; i < line.size; ++i)
    {
        if (cline[i] == ' ') {
                  if (word != "") 
                       words.push_back(word);
                  word = "";
             }
        }
        if (cline[i] == \n || cline == EOF) {
             if (word != "") 
                 words.push_back(word);
             break;
        }
        word += cline[i];
    }
    Don't know if the above is 100% correct, but you get the idea. Just scan through the string and do exactly what you want.

  6. #6
    The larch
    Join Date
    May 2006
    Posts
    3,573
    Exactly what word do you want to index? The first? Entire line except first word? All words?

    If it is the first word that you want, perhaps you need something like:

    Code:
    create.addWord(line.substr(0, position), PageNo);
    If you want to split the string entirely up into words, then perhaps boost::split can be of help.
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  7. #7
    Registered User
    Join Date
    Apr 2008
    Posts
    122
    Quote Originally Posted by C_ntua View Post
    Guessing time!
    Code:
    string line;
    int start = line.find(' ', 0) + 1;
    int end = line.find(' ', start);
    string word = line.substr(line, start, end - start);
    to get a word from the beginning. My guess it that you foget the +1? Or you put end as the second parameter in substr?

    Here is a more C way (if you prefer):
    Code:
    char* cline = line.c_str();
    string word("");
    vector<string> words;
    bool start = false;
    for (int i = 0; i < line.size; ++i)
    {
        if (cline[i] == ' ') {
                  if (word != "") 
                       words.push_back(word);
                  word = "";
             }
        }
        if (cline[i] == \n || cline == EOF) {
             if (word != "") 
                 words.push_back(word);
             break;
        }
        word += cline[i];
    }
    Don't know if the above is 100% correct, but you get the idea. Just scan through the string and do exactly what you want.
    Alright that works but how to I move through the line instead of just getting the start of every line?

  8. #8
    Registered User
    Join Date
    Apr 2008
    Posts
    122
    this prints out the same letter like 50 times too.

  9. #9
    Registered User
    Join Date
    Apr 2008
    Posts
    122
    Quote Originally Posted by anon View Post
    Exactly what word do you want to index? The first? Entire line except first word? All words?

    If it is the first word that you want, perhaps you need something like:

    Code:
    create.addWord(line.substr(0, position), PageNo);
    If you want to split the string entirely up into words, then perhaps boost::split can be of help.
    I want to index the entire line if the word is > 3 letters.

  10. #10
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    OK. Here is a working code (compiled and run it with notepad+command prompt just for you)
    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    using namespace std;
    
    int main(){
    string line("hey dude dad");
    const char* cline = line.c_str();
    string word("");
    vector<string> words;
    int size = line.size();
    for (int i = 0; ;++i)
    {
        if (cline[i] == ' ') {
                  if (word != "") 
                       words.push_back(word);
                  word = "";
                  continue;
        }
        if (cline[i] == '\n' || cline[i] == '\0') {
             if (word != "") 
                 words.push_back(word);
             break;
        }
        word += cline[i];
    }
    cout << "WORDS" << endl;
    for (int i = 0; i < words.size(); ++i)
         cout << words[i] << endl;
    }
    The previous one had a lot of bugs. Some obvious some not. But I said it wasn't working.

    A C++ version:

    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    using namespace std;
    
    int main(){
    string line("hey dude dad ");
    string word;
    vector<string> words;
    int start = 0, end;
    while ((end = line.find(' ', start)) != string::npos)
    {
       word = line.substr(start, end - start);
       if (word != "")
           words.push_back(word);
       start = end + 1;
    }
    word = line.substr(start, line.size() - start);
    if (word != "")
       words.push_back(word);
    cout << "WORDS" << endl;
    for (int i = 0; i < words.size(); ++i)
         cout << words[i] << endl;
    }

  11. #11
    Registered User
    Join Date
    Apr 2008
    Posts
    122
    Quote Originally Posted by C_ntua View Post
    OK. Here is a working code (compiled and run it with notepad+command prompt just for you)
    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    using namespace std;
    
    int main(){
    string line("hey dude dad");
    const char* cline = line.c_str();
    string word("");
    vector<string> words;
    int size = line.size();
    for (int i = 0; ;++i)
    {
        if (cline[i] == ' ') {
                  if (word != "") 
                       words.push_back(word);
                  word = "";
                  continue;
        }
        if (cline[i] == '\n' || cline[i] == '\0') {
             if (word != "") 
                 words.push_back(word);
             break;
        }
        word += cline[i];
    }
    cout << "WORDS" << endl;
    for (int i = 0; i < words.size(); ++i)
         cout << words[i] << endl;
    }
    The previous one had a lot of bugs. Some obvious some not. But I said it wasn't working.

    A C++ version:

    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    using namespace std;
    
    int main(){
    string line("hey dude dad ");
    string word;
    vector<string> words;
    int start = 0, end;
    while ((end = line.find(' ', start)) != string::npos)
    {
       word = line.substr(start, end - start);
       if (word != "")
           words.push_back(word);
       start = end + 1;
    }
    word = line.substr(start, line.size() - start);
    if (word != "")
       words.push_back(word);
    cout << "WORDS" << endl;
    for (int i = 0; i < words.size(); ++i)
         cout << words[i] << endl;
    }

    YOU ARE THE MAN!!!! Thanks so much!

  12. #12
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    hmm... but if your definition of "word" just means "a group of consecutive non-whitespace characters", then a stringstream will suffice:
    Code:
    #include <iostream>
    #include <string>
    #include <vector>
    #include <sstream>
    
    int main()
    {
        using namespace std;
    
        string line("hey dude dad ");
        vector<string> words;
    
        stringstream ss(line);
        string word;
        while (ss >> word)
        {
            words.push_back(word);
        }
    
        for (vector<string>::size_type i = 0, size = words.size(); i != size; ++i)
        {
            cout << words[i] << endl;
        }
    }
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  13. #13
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Look at this also if you want http://www.cprogramming.com/faq/cgi-...&id=1044780608

    So:
    1) strinstreams for whitespace separated words
    2) string functions for other separators
    3) a char by char scan (good old C style)

    In your case you will probably want option 2 or 3, since you have something more general. Whatever seems more easy for you.

    I can tell you right away that if you want words to be separated also with , ., !, < etc etc then you could add some || to my C style code. And then you can read a whole file (NOT including the EOF) and have a vector with all the words

  14. #14
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by C_ntua
    I can tell you right away that if you want words to be separated also with , ., !, < etc etc then you could add some || to my C style code.
    By changing your "more" C++ version to use find_first_of() instead of find(), something similiar can be applied.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  15. #15
    Sweet
    Join Date
    Aug 2002
    Location
    Tucson, Arizona
    Posts
    1,820
    You can also use getline on the stream to build a series of word.
    Code:
    #include <iostream>
    #include <vector>
    #include <string>
    #include <sstream>
    
    int main()
    {
    	//Will be used to determine how to split words up.
    	const char WordDeliminter = ' ';
    
    	//
    	//Build the word string and load it into the string stream
    	//
    	std::string wordString = "Hello how are you today?";
    	std::stringstream wordStream(wordString);
    
    	//Will be used to hold all the words.
    	std::vector<std::string> wordContainer;
    
    	//
    	//Use getline with the stream to get all words seperated by space. Add the words to the vector.
    	//
    	std::string word;
    	while(std::getline(wordStream, word, WordDeliminter)){
    		wordContainer.push_back(word);
    	}//while
    
    	//
    	//Loop through and output all the words.
    	//
    	for(std::vector<std::string>::size_type i = 0; i < wordContainer.size(); i++){
    		std::cout<<wordContainer[i]<<std::endl;
    	}//for
    
    	std::cin.get();
    	return 0;
    }
    Woop?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. String issues
    By The_professor in forum C++ Programming
    Replies: 7
    Last Post: 06-12-2007, 09:11 AM
  2. can anyone see anything wrong with this code
    By occ0708 in forum C++ Programming
    Replies: 6
    Last Post: 12-07-2004, 12:47 PM
  3. Linked List Help
    By CJ7Mudrover in forum C Programming
    Replies: 9
    Last Post: 03-10-2004, 10:33 PM