Thread: Generate list of tokens for a query

  1. #1
    Registered User
    Join Date
    Aug 2012
    Posts
    78

    Generate list of tokens for a query

    Hello there!


    I'm trying to solve the following problem:

    Given a query, find all possible tokens, i.e. query-splits

    Example:
    query = New York Hotel
    tokens = {New, York, Hotel, New York, New Hotel, ...., New Hotel York, Hotel New York, ...}
    Given a query with word count n, the total number of tokens is:
    n + n*(n-1) + n*(n-1)*(n-2) + ... + n! (any explicit formula available for this sum?)

    Now, I have came across various code snippets to generate permutations for a string, but never for a sentence.

    Any ideas on how to proceed?
    Last edited by in_ship; 06-09-2013 at 12:48 AM.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    So you know how to permute "abc", but not "New York Hotel" ?

    How about permuting 0,1,2 and use them as indices into an array of words containing New York Hotel ?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Aug 2010
    Location
    Poland
    Posts
    733
    Quote Originally Posted by in_ship View Post
    Now, I have came across various code snippets to generate permutations for a string, but never for a sentence.
    Sentence is also a string. `std::string` doesn't care if it holds a word, a single character, or a whole sentence. You need to split the query string into tokens:

    Code:
    std::string query; // your sentence
    
    std::getline(std::cin, query); // read the sentence
    
    // split sentence from `query` into separate words
    std::istringstream ss(query);
    std::vector<std::string> words;
    while (ss)
    {
        std::string word;
        ss >> word;
        if (!word.empty())
        {
            words.push_back(std::move(word));
        }
    }
    
    /* Here, generate permutations for the strings in `words`. */

  4. #4
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    I found another way of doing it, using next_permutation.

    I have extended the problem a bit: I read a list of queries from a file, compute all its tokens/permutations and lookup whether and how many times these tokens are contained in the same file itself.
    Then, I compute a range of statistics (mean, median, mode).

    My code is as follows:

    1) First I define 2 help functions, which are used inside main for basic calculations (count # words, compute factorial)

    Code:
    inline unsigned CountWords( const string &s )
    {
    
        string x = s;
        replace_if( x.begin(), x.end(), std::ptr_fun <int, int> ( std::isspace ), ' ' );
        x.erase( 0, x.find_first_not_of( " " ) );
    
        if (x.empty()) return 0;
        return count( x.begin(), std::unique( x.begin(), x.end() ), ' ' ) + !std::isspace( *s.rbegin() );
    
    }
    
    
    
    int factorial(int number) {
    
        if(number <= 1) return 1;
    
        int tmp;
        tmp = number * factorial(number - 1);
        return tmp;
    
    }

    2) Process input data and store it into vector list of queries

    Code:
    int main( int argc, char * args[] )
    {
    
        // Declare input/output files
        ifstream input("dummy.txt");
        string token_stats = "TokenStats.txt";
        ofstream fcoclickout(token_stats.c_str());
    
    
        // Read query-by-query and insert into list of queries
        vector<string> querySet;
        string query, dummy;
        while (getline(input, query, '\t'))
        {
            querySet.push_back(query);
            getline(input, dummy, '\n');                        // discard remaining part of each line (after query)
        }

    3) Compute and process tokens

    Code:
        // Traverse each query
        vector<string>::iterator it;
        vector<int> tokenCount;
        vector<double> isContainedTokenPercentage;
        string q;
        int n, tokens;
        for (it = querySet.begin(); it != querySet.end(); it++)
        {
    
            // Count number of words in query
           q = *it;
           n = CountWords(q);
    
    
           // Write query words into new array
           char * queryArray[n];
           char str[q.size()+1];
           strcpy(str, q.c_str());
           int j = 0;
           char * pch = strtok(str, " ");
           while (pch != NULL)
           {
    //         printf ("%s\n", pch);
             pch = strtok (NULL, " ");
             queryArray[j] = pch;
             j++;
           }
    
    
           // Count number of tokens for each query
           tokens = factorial(n);
           int sum = 0;
           for (int i = 1; i <= n; i++)
           {
               sum = sum + factorial(i);
           }
           tokens = tokens * sum;
           tokenCount.push_back(tokens);                            // save total # of tokens for overall statistics
    
    
           // Write all query tokens into new array
           string queryTokens[tokens];
           string tmp = "";
           int t = 0;
    
           sort(queryArray, queryArray + n);
           do    {
    
                      for (int l = 0; l < n; l++)
                      {
                          tmp = tmp + queryArray[l];
                          cout << "tmp: " << tmp << endl;
                      }
    
                      queryTokens[t] = tmp;
                      t++;
    
           }    while ( next_permutation(queryArray, queryArray + n) );

    4) Look up how many of those tokens are contained in the vector list and compute statistics from it

    Code:
           // look up tokens in the dictionary (discard token = original query - is it index = 0? find out!)
           // Count # of tokens contained in dictionary
           int count = 0;
           for (int i = 0; i < tokens; i++)
           {
    
                vector<string>::iterator it_query;
                for (it_query = querySet.begin(); it_query != querySet.end(); it_query++)
                {
                   const char * a = (*it_query).c_str();
                   const char * b = queryTokens[i].c_str();
                   if (strcmp(a, b) == 0)            count++;
                }
    
           }
    
           // save fraction of tokens contained in dictionary/total # of tokens
           isContainedTokenPercentage.push_back(count/tokens);
    
    
        }
    
    
        // Compute total # of tokens
        int totalTokens = 0;
        vector<int>::iterator it_tokenCount;
        for (it_tokenCount = tokenCount.begin(); it_tokenCount != tokenCount.end(); it_tokenCount++)
        {
                totalTokens = totalTokens + *it_tokenCount;
        }
    
    
        // Compute mean, median and mode for # of query-tokens
    
    
    
        // Compute mean, median and mode for # of isContainedTokenPercentage
    
    
    
    
    
    
        // Print statistics
        fcoclickout << "Total (# tokens): " << totalTokens << endl;
    
    
    
    
    
        cout << "done";
        return 0;
    
    }
    However, the output file Tokenstats.txt is empty. I am rather unsure where the error occurred.
    Last edited by in_ship; 06-10-2013 at 02:54 AM.

  5. #5
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    I've used next_perm to perform permutations. However, I only obtain permutations of same length as the string itself (not of substrings)

    queryArray contains the entire query split into words, where each word is one element of the array.
    queryTokens contains the entire list of permutations, where each permutation is stored as a string in one element of the array.

    Code:
           string queryTokens[tokens];
           string tmp = "";
           int t = 0;
    
           do    {
    
                      for (int l = 0; l < n; l++)
                      {
                          tmp = tmp + " " + queryArray[l];
                      }
    
                      queryTokens[t] = tmp;
                      cout << "perm " << t << ": "<< tmp << endl;
                      t++;
                      tmp = "";
    
           }    while ( next_permutation(queryArray, queryArray + n) );
    Sample output for query iphone for multiple users
    perm 0: iphone for multiple people
    perm 1: iphone for people multiple
    perm 2: iphone multiple for people
    perm 3: iphone multiple people for
    perm 4: iphone people for multiple
    perm 5: iphone people multiple for
    perm 6: for iphone multiple people
    perm 7: for iphone people multiple
    perm 8: for multiple iphone people
    perm 9: for multiple people iphone
    perm 10: for people iphone multiple
    perm 11: for people multiple iphone
    perm 12: multiple iphone for people
    perm 13: multiple iphone people for
    perm 14: multiple for iphone people
    perm 15: multiple for people iphone
    perm 16: multiple people iphone for
    perm 17: multiple people for iphone
    perm 18: people iphone for multiple
    perm 19: people iphone multiple for
    perm 20: people for iphone multiple
    perm 21: people for multiple iphone
    perm 22: people multiple iphone for
    perm 23: people multiple for iphone

    Substrings, which are being left out, are:
    iphone, for, multiple, people, iphone for, for iphone, multiple people, people multiple for, .....
    How to get the remaining 40 permutations (24 done, 64 should be there in total)?

  6. #6
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    So, I have a vague idea of what can be done here:
    I can use my function next_perm as before.
    However, I need to now apply it not to the full array, but sub-parts of the array (with length = 1, ... , length(queryArray)).

    But how do I create these sub-parts? Let's again assume the query is New York Hotel.

    Length = 1: Index 0, 1, 2 --> permutations: New, York, Hotel
    Length = 2: Index 01, 02, 12 --> permutations: New York (->York New), York Hotel (-> Hotel York), New Hotel (-> Hotel New)
    Length = 3: Index 012 --> permutations: New York Hotel (discard this token, since it's identical to original query), New Hotel York, York Hotel New, York New Hotel, Hotel New York, Hotel York New

  7. #7
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    You need to handle all the combinations of n things taken 1 at a time, n things taken 2 at a time, ... , n things taken n at a time, then generate the permutations for all of the possible combinations.

  8. #8
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by rcgldr View Post
    You need to handle all the combinations of n things taken 1 at a time, n things taken 2 at a time, ... , n things taken n at a time, then generate the permutations for all of the possible combinations.
    Yes, this is exactly what I mentioned in my previous post. The question is now - how to create these n different combinations, which are then input to the permutation black-box.

  9. #9
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    I tried using the following piece of code (inspired by stackoverflow), but I get a completely random output now (no words, just characters).

    Code:
       
    
    int main(){
    
     list<char> lt;
           subset(queryArray, 5, 3, 0, lt);
    
        return 0; }
    void print( list<char> l){
        for(list<char>::iterator it=l.begin(); it!=l.end() ; ++it)
                cout << " " << *it;
        cout<<endl;
    }
    
    void subset(char * arr[], int size, int left, int index, list<char> &l){
        if(left==0){
            print(l);
            return;
        }
        for(int i=index; i<size;i++){
            l.push_back(*arr[i]);
            subset(arr,size,left-1,i+1,l);
            l.pop_back();
        }
    
    }


    i f m
    i f p
    i f
    i m p
    i m
    i p
    f m p
    f m
    f p
    m p

  10. #10
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by rcgldr View Post
    You need to handle all the combinations of n things taken 1 at a time, n things taken 2 at a time, ... , n things taken n at a time, then generate the permutations for all of the possible combinations.
    Quote Originally Posted by in_ship View Post
    Yes, this is exactly what I mentioned in my previous post. The question is now - how to create these n different combinations, which are then input to the permutation black-box.
    One way is nested loops:

    Code:
    /* n things 1 at a time */
        for(i = 0; i < n; i++){
    
    /* n things 2 at a time */
        for(i = 0; i < n-1; i++){
            for(j = i+1; j < n; j++){
    
    /* n things 3 at a time */
        for(i = 0; i < n-2; i++){
            for(j = i+1; j < n-1; j++){
                for(k = j+1; k < n; k++){
    There's probably a more generic recursive or recursive like way to do this. Another alternative would be to use an array of indices instead of i, j, k, ... .
    Last edited by rcgldr; 06-11-2013 at 01:18 AM.

  11. #11
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by rcgldr View Post
    One way is nested loops:

    Code:
    /* n things 1 at a time */
        for(i = 0; i < n; i++){
    
    /* n things 2 at a time */
        for(i = 0; i < n-1; i++){
            for(j = i+1; j < n; j++){
    
    /* n things 3 at a time */
        for(i = 0; i < n-2; i++){
            for(j = i+1; j < n-1; j++){
                for(k = j+1; k < n; k++){
    Nested loops work only if your know the number of nested loops you want. However, this number varies, depending on the word count of a query and thus, these loops cannot be hard-coded.

    What do you mean by using array indices?

  12. #12
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    This example program generates all combinations and permutations for n things taken 1 to n at a time.

    Code:
    /*----------------------------------------------------------------------*/
    /*      combperm.c  create a list combinations and permuations for      */
    /*                  n integers taken 1 to n at a time                   */
    /*                  where integers range from 0 to n-1                  */
    /*----------------------------------------------------------------------*/
    #include <stdio.h>
    #include <malloc.h>
    
    /* set to 0 to do only combinations */
    #define DOPERM 1
    
    static int n = 5;                       /* n integers */
    static int k;                           /* taken k at a time */
    
    static int *c;                          /* combinations */
    static int *p;                          /* permutations */
    
    static void doc(void);
    static void dop(void);
    static void swap(int, int);
    static void showp(void);
    
    /*----------------------------------------------------------------------*/
    /*      main                                                            */
    /*----------------------------------------------------------------------*/
    int main(int argc, char **argv)
    {
        c = malloc(n *  sizeof(int));       /* allocate arrays */
        p = malloc(n *  sizeof(int));
    
        for(k = 1; k <= n; k++)             /* n things k at a time */
            doc();
     
        free(p);                            /* free arrays */
        free(c);
    
        return(0);                          /* return */
    }
    
    /*----------------------------------------------------------------------*/
    /*      doc         do combinations                                     */
    /*----------------------------------------------------------------------*/
    static void doc()
    {
    int i, j;
    
        for (i = 0; i < k;  i++)            /* generate initial combination */
            c[i] = i;
    
        dop();                              /* do permutations */
    
        while(1)                            /* generate next combination */
        {       
            i = k-1;                        /* find next element to increment */
            while(c[i] == (n-k+i))
                --i;
            if (i < 0)                      /* return if done */
                return;
            ++c[i];                         /* increment element */
    
            for(j = i+1; j < k; j++)        /* create increasing string */
                c[j] = c[i]+j-i;
    
            dop();                          /* do permutations */
        }
    }
    
    /*----------------------------------------------------------------------*/
    /*      dop         do permutations                                     */
    /*----------------------------------------------------------------------*/
    static void dop()
    {
    int i;
    #if DOPERM
    int j;
    #endif
    
        for (i = 0; i < k;  i++)            /* initial perm = comb */
            p[i] = c[i];
        printf("\n");
        showp();                            /* show permutation */
    
    #if DOPERM
    
        if(k < 2)                           /* return if only 1 element */
            return;
    
        while(1)                            /* generate next permutation */
        {
    
    /*      scan backwards as long as p[] order is reversed */
        
            i = k-2;                        /* scan backwards until */
            while(p[i] >= p[i+1])           /* p[i] < p[i+1] */
                i--;
            if (i < 0)                      /* return if done (all reversed) */
                return;
    
    /*      p[i] is left most integer to swap */
    /*      scan from right end until p[j] > p[i] is found */
    
            j = k-1;                        /* scan backwards until */
            while (p[j] <= p[i])            /* p[j] > p[i] */
                j--;
    
    /*      p[j] is right most integer to swap */
    
            swap(i, j);                     /* swap elements */
    
    /*      reverse p[] from p[i+1] to p[k-1] */
    
            i++;                            /* reverse elements */
            j = k-1;
            while (i < j)
            {
                swap(i, j);
                i++;
                j--;
            }
    
            showp();                        /* show permutation */
        }
    #endif                                  /* #if DOPERM */
    }
    
    /*----------------------------------------------------------------------*/
    /*      swap two elements of p                                          */
    /*----------------------------------------------------------------------*/
    static void swap(int i, int j)
    {
    int temp;
    
        temp = p[i];
        p[i] = p[j];
        p[j] = temp;
    }
    
    
    /*----------------------------------------------------------------------*/
    /*      showp       show permutation                                    */
    /*----------------------------------------------------------------------*/
    static void showp(void)
    {
    int i;
    
        for (i = 0; i < k; i++)
            printf(" %1d", p[i]);
        printf("\n");
    }
    Last edited by rcgldr; 06-11-2013 at 06:45 AM.

  13. #13
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Explanation for the permutation code:

    Code:
    /* this searches the indexes backwards to find the last index that is not at it's maximum value */
    /* usually the last index is not at it's maximum value so the while() does not loop */
    
            i = k-1;                        /* find next element to increment */
            while(c[i] == (n-k+i))
                --i;
    /* if all the indexes are at their max value, the process is complete */
    
            if (i < 0)                      /* return if done */
                return;
    
    /* this increments the last index not at it's maximum value */
    
            ++c[i];                         /* increment element */
    
    /* if the last index is not at it's maximum value (the usual case), the for() will not loop */
    /* if one or more of the indexes were at their maximum value, */
    /* then the last index not at it's maximum value was incremented, */
    /* and the indexes after the incremented index need to be reset to an incrementing sequence */
    /* c[i+1] = c[i] + 1 ... c[i+2] = c[i] + 2 ... c[i+3] = c[i] + 3 ... */
    
            for(j = i+1; j < k; j++)        /* create increasing string */
                c[j] = c[i]+j-i;
    Last edited by rcgldr; 06-11-2013 at 03:16 PM.

  14. #14
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by rcgldr View Post
    Explanation for the permutation code:
    The previous post is an explanation of the permutation code. To further explain this, take the case of n things taken 3 at a time, using 3 nested loops, using <= for compare instead of < :

    Code:
    /* ... */
        for(i = 0; i <= n-3; i++){
            for(j = i+1; j <= n-2; j++){
                for(k = j+1; k <= n-1; k++){
    /* ... */
    This set of nested loops starts off with (i, j, k) = (0, 1, 2) and ends after the loop where (i, j, k) == (n-3, n-2, n-1). The inner most loop exits after the loop with k == (n-1), then j is incremented and k set to j + 1. The middle loop exits after the loop with j == (n-2), then i is incremented, j set to i + 1, and k set to j + 1 (or i + 2). The looping process is complete after the loop where i == (n-3). The permutation code follows this same logic, but in a generic manner.
    Last edited by rcgldr; 06-11-2013 at 05:01 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Structures and tokens
    By chetah in forum C Programming
    Replies: 4
    Last Post: 04-10-2013, 12:40 PM
  2. Strings and cin.get/tokens
    By Kayoss in forum C++ Programming
    Replies: 2
    Last Post: 02-28-2006, 06:51 PM
  3. Trouble with Tokens
    By DarrenY in forum C Programming
    Replies: 4
    Last Post: 05-21-2005, 02:25 PM
  4. Tokens?
    By trenzterra in forum C++ Programming
    Replies: 12
    Last Post: 05-03-2005, 08:14 AM
  5. any use for tokens?
    By tetraflare in forum C++ Programming
    Replies: 1
    Last Post: 01-03-2003, 10:34 AM