# encryption algorithm for finding the most frequent word

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 01-25-2010
r00t
encryption algorithm for finding the most frequent word
Hello folks!
Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
Limits: max - 50000 word, each word max - 30 chars.
So here is how im trying to implement it.
1) count the number of words (N)
2) malloc the integer array with the size of N
3) read the word by word
4) encrypt the word so that the encryption algorithm generates the unique integer which is less than N
5) increase the value of array in index of the generated integer
for example generated integer = 1234... then a[1234]++
6) at the end, decrypt the index which has the greatest value
7) output it

So, does anyone know such an algorithm? or any other suggestions are welcome!

• 01-25-2010
A tree, a hash (sounds like what you're suggesting), might work fine. I would write out each word of the file (one line per word), into a file. Then use mergesort to sort all the words.

Now finding the most common word is trivial - just count which word has the most repeats.
• 01-27-2010
r00t
ok thanx, i'll try to find hashing function
• 01-27-2010
jeffcobb
Quote:

Originally Posted by r00t
ok thanx, i'll try to find hashing function

Google for "histogram". Add in "C source" for examples. This is easy though...as for the hashing, what I think you want is a simple string hash, not an encryption method (which means to obfuscate. Snippets.org is your friend. Also look up "string CRC" for other helpers....
• 01-27-2010
r00t
im not sure that hashing works in reverse algorithm, i.e i can get the actual string from that.. if not then i need encryption algorithm, so then i can decrypt it.
• 01-27-2010
brewbuck
This will work with a slight modification. You cannot "decrypt" a hash value, usually. For the hash to be reversible, it must not lose information, which needs it needs a bit content at least as long as the total information available in all the words -- this would require an array of astounding size.

Instead, store the original string along with the count in the hash table. Then, when you find the largest entry in the table, you just extract the corresponding word.

What are you going to do when two strings both hash to the same value? If the number of entries in your array is less than the number of strings you are examining, this is guaranteed to happen.
• 01-27-2010
r00t
Quote:

What are you going to do when two strings both hash to the same value?
that's the thing man, i need an algorithm that generates a unique integer, for every different string
• 01-27-2010
Quote:

Originally Posted by r00t
that's the thing man, i need an algorithm that generates a unique integer, for every different string

Don't you already have a unique integer for every string, in the set of the ascii values, for each letter in the word?

Char's are just small int's, with an offset of 48.

So aren't the letters themselves, the unique "key" that you need? Unless you're trying to keep text unreadable, letters are just as good as digits - indeed, they are made of digits.

A fast easy encryptor? XOR the letters ascii value with the next digit in a number* you have as your key. To decrypt it, you just repeat that XOR, with the same key.

*and for that key number, I would use a word or short phrase with N-1 letters or so, and use the ascii value of it's letters, as my number key. :)
• 01-27-2010
r00t
the maximum number of characters in a word is 30, so if asscii value of every char is 2 digits then, i would have 60 digit integer. am i right? i dont think this way going to work.
please correct me if im getting things wrong, thanx!
• 01-27-2010
OK, you're wrong. :)

You can't encrypt our alphabet in just 1 digit numbers (in base 10). You must have at least 26 digits for a range large enough to handle the alphabet.

Now, you could encrypt the word into other char's - e.g.: a = # , etc. But if you need an integer to represent uniquely, a word, then you can't do it with just a range of -9 to +9, and that's all the one digits numbers we have.

From my limited knowledge of encryption, two digits are customary to represent a letter. We could make all the most common letters, represented by 1 digit numbers, however. That doesn't leave a lot of two digit letters. If security is not a concern, then we should consider it?
• 01-28-2010
r00t
yeah i c what u mean, but it wont work with the arrays right? i cant use that integer as an index of an element of an array. would u recommend any other way to solve my problem?
thanx :)
• 01-28-2010
_Mike
Here's a quick example of how it could be done.
The code could be optimized quite a bit, and loading words from a file is not included. (Don't want to finish the whole assignment for you) :)
There's also an intentional bug somewhere in there :D

Code:

```#include <stdio.h> #include <string.h> typedef struct {         char* word;         int count;         void* next; } wordlist; // if the list contains the word to scan for return pointer to that node // Otherwise return NULL wordlist* find_word(wordlist* list, char* word) {         wordlist* current = list;         while(current)         {                 if(strcmp(current->word, word) == 0)                         return current;                 current = current->next;         }         return NULL; } // adds word to the list // returns a pointer to the newly added word wordlist* add_word(wordlist* list, char* word) {         wordlist *current=list, *tmp=NULL;         while(current)         {                 tmp = current;                 current = current->next;         }         current = malloc(sizeof(*current));         current->word = word;         current->count = 1;         current->next = NULL;         if(tmp)                 tmp->next = current;         return current; } // frees allocated memory void delete_list(wordlist* list) {         wordlist *current = list;         while(current)         {                 current = current->next;                 free(current);         } } int main() {         char* words[] = { "this", ".", "is", ".", "a", ".",                 "test", ".", "of", ".", "a", ".", "word", ".", "list" };         wordlist *list=NULL, *current=NULL, *max=NULL;         int i, numwords=15;         for(i=0; i<numwords; i++)         {                 wordlist* tmp = find_word(list, words[i]);                 if(tmp)                 {                         tmp->count++;                         if(max->count < tmp->count)                                 max = tmp;                 }                 else                 {                         current = add_word(current, words[i]);                         if(!list) list = current;                         if(!max) max = current;                 }         }         printf("most frequent word is: %s (%d times)\n", max->word, max->count);         delete_list(list);         return 0; }```
• 01-28-2010
Quote:

Originally Posted by r00t
Hello folks!
Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
Limits: max - 50000 word, each word max - 30 chars.
So here is how im trying to implement it.
1) count the number of words (N)
2) malloc the integer array with the size of N
3) read the word by word
4) encrypt the word so that the encryption algorithm generates the unique integer which is less than N
5) increase the value of array in index of the generated integer
for example generated integer = 1234... then a[1234]++
6) at the end, decrypt the index which has the greatest value
7) output it

So, does anyone know such an algorithm? or any other suggestions are welcome!

I see (finally!) what this is - :)

This is a distribution count (aka bucket sort), straight up. The only thing different is the need to encrypt the words into integers which must be less than N.

D'oh!

Let me poke around.
• 01-28-2010
_Mike
Quote:

The only thing different is the need to encrypt the words into integers which must be less than N.

I don't think that's required, unless I misunderstood the OP. But the way I read it was that the only requirement is
Quote:

Originally Posted by r00t
Today my task is to implement the program which inputs the name of the file and outputs the most frequent word in the textfile.
Limits: max - 50000 word, each word max - 30 chars.

And the rest of the post is just describing how he had tried to solve it.
• 01-28-2010
r00t
Quote:

This is a distribution count (aka bucket sort), straight up
yep, it should be called bucket sort, but i dont remember the exact name.
and yes, the whole point of this thread was to discover the right algorithm to encrypt it :)

thanx for the code _Mike, i'll have a look at it
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last