-
hash tables
I am writing a langauge filter for network chat system im writing, i want to search the message for any disallowed words. At the moment each word is hard coded in an array, though efficient it doesnt allow modifications, i need to use them in a file. I have been recommended to use hash tables, can someone explain what these are and how i might go about using them
thanks in adv.
-
Okay, a hash table is essentially an efficient way to store data in an array according to a unique identifier number. The efficient part about it is that searching a hash table is incredibly fast. Ready to have some fun? I prefer to use a dynamic array template for the table:
Code:
template <class T>
class HashTable{
public:
HashTable(int n) : size(n) {
array = new T[n];
for( int i = 0; i < n; ++i)
array[i] = T();
int size(){return size;}
T& operator[](int x){return array[x];}
protected:
T *array;
int size;
};
Let's say that the table will consist of the first and last names of people, just to keep it simple. You would use a struct similar to this:
Code:
struct Person{
bool isNull(){return bool(last.length() == 0);}
string last;
string first;
};
Then there's the part of the main program that actually creates the hash table and puts stuff in it.
Code:
int main(void){
ifstream IN("people.txt");
Person person;
HashTable<Person> table(SIZE);
while(get(person, IN)){
int i = hash(person.last + person.first);
while(!table[i].isNull())
i = (i+1) % SIZE;
table[i] = person;
}
print(table);
}
Then you define the actual hash function that turns the two concatenated names into a number.
Code:
int hash(string s){
int h = 0;
for(int i = 0; i < s.length(); ++i)
h += int(s[i]);
return h % SIZE;
}
And then of course you need the rest of the functions for the program
Code:
bool get(Person& person, ifstream IN){
char buffer[BUFSIZE], temp[BUFSIZE];
IN.getline(buffer, BUFSIZE);
if(IN.fail()) return false;
istrstream s(buffer);
s.getline(temp, BUFSIZE, '\t'); //names separated by tab
person.last = temp;
s.getline(temp, BUFSIZE, '\t');
person.first = temp;
s.getline(temp, BUFSIZE, '\t'); //eat the tab :)
return true;
}
void print(Person& person){
cout<<person.last<<", "<<person.first<<endl;
}
template <class T>
void print(HashTable<T>& t){
for(int i = 0; i < t.size(); ++i){
cout<<setiosflags(ios::right)<<setw(4)<<i<<". "; //print index #
if(!t[i].isNull()) printf(t[i]);
else cout<<endl;
}
}
Having fun yet? Be sure to run it and toy with it, I'm not sure if there are errors because I wrote this off the top of my head, so be ready for debugging and please accept my appologies. You'll need to include <fstream>, <iomanip>, <iostream>, and <sstream>.
-Prelude
-
Thanks Prelude, thats great - i understand now :)
-
It needs a destructor to get rid of the allocated memory.
Also, a more efficient implementation of a hash table for containing x uses an array of x uses an array of linked lists of x rather than an array of x.
This hash tables performance degenerates very quickly, even with a good hash function, as the number of elements in the table approaches the size of the array. When the number of elements is the size, a linear search of half the table is required on average, when there is one less element in the list than a the size of the array, a linear search of a quarter of the array is employed on average. This is because things that hash to 7 can effect the search for things that hash to 6.
For theoritical example, consider adding the following
blah hashes to 6, at index 6
word hashes to 7, at index 7
etc hashes to 6, at index 8
Search for etc
start at index 6
does blah equal etc? no, go to 7
does word equal etc? no, go to 8
does etc equal etc? yes, return etc
If this was implemented using a linked list, searching for items at index 6 requires searching the list for 6, and nothing else. So, even with a large amount of insertions at links slightly greater than 6, the speed of searching for an element at 6 will not slow down.
Then again, surely the guys who wrote the freely availible (SGI) STL hash_map have me beat in performance ;).