hash tables

I am writing a langauge filter for network chat system im writing, i want to search the message for any disallowed words. At the moment each word is hard coded in an array, though efficient it doesnt allow modifications, i need to use them in a file. I have been recommended to use hash tables, can someone explain what these are and how i might go about using them
thanks in adv.

Okay, a hash table is essentially an efficient way to store data in an array according to a unique identifier number. The efficient part about it is that searching a hash table is incredibly fast. Ready to have some fun? I prefer to use a dynamic array template for the table:

Code:

template <class T> class HashTable{ public: HashTable(int n) : size(n) { array = new T[n]; for( int i = 0; i < n; ++i) array[i] = T(); int size(){return size;} T& operator[](int x){return array[x];} protected: T *array; int size; };

Let's say that the table will consist of the first and last names of people, just to keep it simple. You would use a struct similar to this:

Code:

struct Person{ bool isNull(){return bool(last.length() == 0);} string last; string first; };

Then there's the part of the main program that actually creates the hash table and puts stuff in it.

Code:

int main(void){ ifstream IN("people.txt"); Person person; HashTable<Person> table(SIZE); while(get(person, IN)){ int i = hash(person.last + person.first); while(!table[i].isNull()) i = (i+1) % SIZE; table[i] = person; } print(table); }

Then you define the actual hash function that turns the two concatenated names into a number.

Code:

int hash(string s){ int h = 0; for(int i = 0; i < s.length(); ++i) h += int(s[i]); return h % SIZE; }

And then of course you need the rest of the functions for the program

Code:

bool get(Person& person, ifstream IN){ char buffer[BUFSIZE], temp[BUFSIZE]; IN.getline(buffer, BUFSIZE); if(IN.fail()) return false; istrstream s(buffer); s.getline(temp, BUFSIZE, '\t'); //names separated by tab person.last = temp; s.getline(temp, BUFSIZE, '\t'); person.first = temp; s.getline(temp, BUFSIZE, '\t'); //eat the tab :) return true; } void print(Person& person){ cout<<person.last<<", "<<person.first<<endl; } template <class T> void print(HashTable<T>& t){ for(int i = 0; i < t.size(); ++i){ cout<<setiosflags(ios::right)<<setw(4)<<i<<". "; //print index # if(!t[i].isNull()) printf(t[i]); else cout<<endl; } }

Having fun yet? Be sure to run it and toy with it, I'm not sure if there are errors because I wrote this off the top of my head, so be ready for debugging and please accept my appologies. You'll need to include <fstream>, <iomanip>, <iostream>, and <sstream>.

-Prelude

Thanks Prelude, thats great - i understand now :)

It needs a destructor to get rid of the allocated memory.

Also, a more efficient implementation of a hash table for containing x uses an array of x uses an array of linked lists of x rather than an array of x.

This hash tables performance degenerates very quickly, even with a good hash function, as the number of elements in the table approaches the size of the array. When the number of elements is the size, a linear search of half the table is required on average, when there is one less element in the list than a the size of the array, a linear search of a quarter of the array is employed on average. This is because things that hash to 7 can effect the search for things that hash to 6.

For theoritical example, consider adding the following
blah hashes to 6, at index 6
word hashes to 7, at index 7
etc hashes to 6, at index 8

Search for etc
start at index 6
does blah equal etc? no, go to 7
does word equal etc? no, go to 8
does etc equal etc? yes, return etc

If this was implemented using a linked list, searching for items at index 6 requires searching the list for 6, and nothing else. So, even with a large amount of insertions at links slightly greater than 6, the speed of searching for an element at 6 will not slow down.

Then again, surely the guys who wrote the freely availible (SGI) STL hash_map have me beat in performance ;).