Thread: encryption algorithm for finding the most frequent word

  1. #1
    Registered User
    Join Date
    Nov 2009
    Posts
    46

    Smile encryption algorithm for finding the most frequent word

    Hello folks!
    Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
    Limits: max - 50000 word, each word max - 30 chars.
    So here is how im trying to implement it.
    1) count the number of words (N)
    2) malloc the integer array with the size of N
    3) read the word by word
    4) encrypt the word so that the encryption algorithm generates the unique integer which is less than N
    5) increase the value of array in index of the generated integer
    for example generated integer = 1234... then a[1234]++
    6) at the end, decrypt the index which has the greatest value
    7) output it

    So, does anyone know such an algorithm? or any other suggestions are welcome!

    Thanks in advance!

  2. #2
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    A tree, a hash (sounds like what you're suggesting), might work fine. I would write out each word of the file (one line per word), into a file. Then use mergesort to sort all the words.

    Now finding the most common word is trivial - just count which word has the most repeats.

  3. #3
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    ok thanx, i'll try to find hashing function

  4. #4
    Registered User jeffcobb's Avatar
    Join Date
    Dec 2009
    Location
    Henderson, NV
    Posts
    875
    Quote Originally Posted by r00t View Post
    ok thanx, i'll try to find hashing function
    Google for "histogram". Add in "C source" for examples. This is easy though...as for the hashing, what I think you want is a simple string hash, not an encryption method (which means to obfuscate. Snippets.org is your friend. Also look up "string CRC" for other helpers....
    C/C++ Environment: GNU CC/Emacs
    Make system: CMake
    Debuggers: Valgrind/GDB

  5. #5
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    im not sure that hashing works in reverse algorithm, i.e i can get the actual string from that.. if not then i need encryption algorithm, so then i can decrypt it.

  6. #6
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    This will work with a slight modification. You cannot "decrypt" a hash value, usually. For the hash to be reversible, it must not lose information, which needs it needs a bit content at least as long as the total information available in all the words -- this would require an array of astounding size.

    Instead, store the original string along with the count in the hash table. Then, when you find the largest entry in the table, you just extract the corresponding word.

    What are you going to do when two strings both hash to the same value? If the number of entries in your array is less than the number of strings you are examining, this is guaranteed to happen.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  7. #7
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    What are you going to do when two strings both hash to the same value?
    that's the thing man, i need an algorithm that generates a unique integer, for every different string

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by r00t View Post
    that's the thing man, i need an algorithm that generates a unique integer, for every different string
    Don't you already have a unique integer for every string, in the set of the ascii values, for each letter in the word?

    Char's are just small int's, with an offset of 48.

    So aren't the letters themselves, the unique "key" that you need? Unless you're trying to keep text unreadable, letters are just as good as digits - indeed, they are made of digits.

    A fast easy encryptor? XOR the letters ascii value with the next digit in a number* you have as your key. To decrypt it, you just repeat that XOR, with the same key.


    *and for that key number, I would use a word or short phrase with N-1 letters or so, and use the ascii value of it's letters, as my number key.
    Last edited by Adak; 01-27-2010 at 08:29 PM.

  9. #9
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    the maximum number of characters in a word is 30, so if asscii value of every char is 2 digits then, i would have 60 digit integer. am i right? i dont think this way going to work.
    please correct me if im getting things wrong, thanx!

  10. #10
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    OK, you're wrong.

    You can't encrypt our alphabet in just 1 digit numbers (in base 10). You must have at least 26 digits for a range large enough to handle the alphabet.

    Now, you could encrypt the word into other char's - e.g.: a = # , etc. But if you need an integer to represent uniquely, a word, then you can't do it with just a range of -9 to +9, and that's all the one digits numbers we have.

    From my limited knowledge of encryption, two digits are customary to represent a letter. We could make all the most common letters, represented by 1 digit numbers, however. That doesn't leave a lot of two digit letters. If security is not a concern, then we should consider it?

  11. #11
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    yeah i c what u mean, but it wont work with the arrays right? i cant use that integer as an index of an element of an array. would u recommend any other way to solve my problem?
    thanx

  12. #12
    Registered User
    Join Date
    Jan 2010
    Posts
    412
    Here's a quick example of how it could be done.
    The code could be optimized quite a bit, and loading words from a file is not included. (Don't want to finish the whole assignment for you)
    There's also an intentional bug somewhere in there

    Code:
    #include <stdio.h>
    #include <string.h>
    
    typedef struct
    {
    	char* word;
    	int count;
    	void* next;
    } wordlist;
    
    // if the list contains the word to scan for return pointer to that node
    // Otherwise return NULL
    wordlist* find_word(wordlist* list, char* word)
    {
    	wordlist* current = list;
    	while(current)
    	{
    		if(strcmp(current->word, word) == 0)
    			return current;
    		current = current->next;
    	}
    	return NULL;
    }
    
    // adds word to the list
    // returns a pointer to the newly added word
    wordlist* add_word(wordlist* list, char* word)
    {
    	wordlist *current=list, *tmp=NULL;
    	while(current)
    	{
    		tmp = current;
    		current = current->next;
    	}
    	current = malloc(sizeof(*current));
    	current->word = word;
    	current->count = 1;
    	current->next = NULL;
    	if(tmp)
    		tmp->next = current;
    	return current;
    }
    
    // frees allocated memory
    void delete_list(wordlist* list)
    {
    	wordlist *current = list;
    	while(current)
    	{
    		current = current->next;
    		free(current);
    	}
    }
    
    int main()
    {
    	char* words[] = { "this", ".", "is", ".", "a", ".",
    		"test", ".", "of", ".", "a", ".", "word", ".", "list" };
    
    	wordlist *list=NULL, *current=NULL, *max=NULL;
    	int i, numwords=15;
    
    	for(i=0; i<numwords; i++)
    	{
    		wordlist* tmp = find_word(list, words[i]);
    		if(tmp)
    		{
    			tmp->count++;
    			if(max->count < tmp->count)
    				max = tmp;
    		}
    		else
    		{
    			current = add_word(current, words[i]);
    			if(!list) list = current;
    			if(!max) max = current;
    		}
    	}
    	printf("most frequent word is: %s (%d times)\n", max->word, max->count);
    	delete_list(list);
    
    	return 0;
    }

  13. #13
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by r00t View Post
    Hello folks!
    Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
    Limits: max - 50000 word, each word max - 30 chars.
    So here is how im trying to implement it.
    1) count the number of words (N)
    2) malloc the integer array with the size of N
    3) read the word by word
    4) encrypt the word so that the encryption algorithm generates the unique integer which is less than N
    5) increase the value of array in index of the generated integer
    for example generated integer = 1234... then a[1234]++
    6) at the end, decrypt the index which has the greatest value
    7) output it

    So, does anyone know such an algorithm? or any other suggestions are welcome!

    Thanks in advance!
    I see (finally!) what this is -

    This is a distribution count (aka bucket sort), straight up. The only thing different is the need to encrypt the words into integers which must be less than N.

    D'oh!

    Let me poke around.

  14. #14
    Registered User
    Join Date
    Jan 2010
    Posts
    412
    Quote Originally Posted by Adak View Post
    The only thing different is the need to encrypt the words into integers which must be less than N.
    I don't think that's required, unless I misunderstood the OP. But the way I read it was that the only requirement is
    Quote Originally Posted by r00t View Post
    Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
    Limits: max - 50000 word, each word max - 30 chars.
    And the rest of the post is just describing how he had tried to solve it.

  15. #15
    Registered User
    Join Date
    Nov 2009
    Posts
    46
    This is a distribution count (aka bucket sort), straight up
    yep, it should be called bucket sort, but i dont remember the exact name.
    and yes, the whole point of this thread was to discover the right algorithm to encrypt it

    thanx for the code _Mike, i'll have a look at it

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. encryption algorithm
    By jtullo in forum C Programming
    Replies: 3
    Last Post: 04-21-2007, 10:48 PM
  2. encryption algorithm
    By xddxogm3 in forum C++ Programming
    Replies: 3
    Last Post: 05-21-2005, 04:30 PM
  3. abt encryption algorithm
    By purIn in forum C Programming
    Replies: 9
    Last Post: 12-22-2003, 10:16 PM
  4. Word Count
    By simple in forum C Programming
    Replies: 12
    Last Post: 10-04-2002, 10:47 PM
  5. Using 'if' with char arrays or string objects
    By c++_n00b in forum C++ Programming
    Replies: 36
    Last Post: 06-06-2002, 09:04 PM