What is hashing?

• 04-16-2002
Nutshell
What is hashing?
Hi,

What is hashing? What are they used for?

Pls explain in a bit more detail coz i really dont' know what it really is, purpose, and usage.

thnx a lot
• 04-16-2002
shtarker
Firstly isn't the Genreal Discussion board for "Non-programming related topics"?

Secondly hashing is a sorting algorithm. There are two types of hashing algorithms:

Open Hashing:
To quote a text book: "The essential idea is that the (possibly infinate) set of potential set members is partitioned into a finite number of classes."
In english this means that if you have a stream of data (this could be anything from a short array to the constant stream of telephone billing information) you have a few set ranges (called buckets) that each element in the stream can fall into. As you process the stream you simply place each element into one of these buckets and move on.
For example sorting a huge dictionary file into alphabetic order, you can pass through the first time just testing the first letter of each word and then placing them into different files based on that. This means you would end up with all the A's in one place, all the B's in another and so on. That can dramaticly reduce the future number of operations you need to perform.

Closed Hashing:
Closed hashing basicly means that each bucket can only store one element. In text book language "A closed hash keeps the members of the dictionary in the bucket table itself, rather than using the table to store list headers. As a consequance, it appears we can only put one element in any bucket. However, associated with closed hashing is a rehash strategy. If we try to place x into bucket h(x) and find it already holds an element, a situation called a collision, the rehash stratgy choses a sequance of alternate locations, h1(x), h2(x) . . . within the bucket table, in which we could place x. We try each of these location untill we find an empty one. If none is empty then the table is full and we cannot place x."
This basicly just means that each bucket is now limited to one element. If you try to put another element into an already full bucket then there is a logical way that you go about searching the other buckets until you find an empty bucket. When you can nolonger find any empty buckets your whole thing is full and you have to start deleting elements.
• 04-16-2002
Nick
Here's a fast example I did which uses hashing with seperate
chaining. It's in c++ but there's no classes or templates.
It works basically works by hashing a string which returns a number n and then it inserts it into A[n]. When it trys to find a string it hashes a string and then it looks for the string
in A[n]. For seperate chaining having the table size being
the smallest prime number greater than the table size is good. The code to hash a string was from weiss's book, basically
it uses horner's algorithm.

Code:

```#include <cstdio> #include <cstdlib> #include <cstring> #include <iostream> #include <fstream> #include <string> using namespace std; const int TABLE_SIZE = 45007; struct Node {     char* s;     Node* next; }; struct Hash_table {     Node* buckets[TABLE_SIZE]; }; int hash(const char* s, int tbl_size); void hash_init(Hash_table* h); void hash_insert(Hash_table* h, const char* s); Node* hash_find(Hash_table* h, const char* s); void build_dictionary(Hash_table* h, char* filename); int main(void) {     Hash_table ht;     string s;     hash_init(&ht);     build_dictionary(&ht, "/usr/share/dict/words");     while(getline(cin, s)) {         Node* n = hash_find(&ht, s.c_str());         if (n != 0)             cout << "the string = " << n->s << endl;         else             cout << "not in the dictionary" << endl;     }     return 0; } int hash(const char* s, int tbl_size) {     int r = 0;     for (int i = 0; s[i] != '\0'; ++i)         r = 37 * r + s[i];     r %= tbl_size;     if (r < 0)         r += tbl_size;     return r; } void hash_init(Hash_table* h) {     for (int i = 0; i < TABLE_SIZE; ++i)         h->buckets[i] = 0; } void hash_insert(Hash_table* h, const char *s) {     int hash_nr = hash(s, TABLE_SIZE);     Node** lst = &h->buckets[hash_nr];     Node* curr = *lst;     bool found = false;     while(curr != 0 && !found) {         if (!strcmp(curr->s, s))             found = true;         curr = curr->next;     }     // if not found insert at the beginning of the list   if (!found) {         Node* tmp = new Node;         tmp->s = new char[strlen(s) + 1];         strcpy(tmp->s, s);         tmp->next = *lst;         *lst = tmp;     } } Node* hash_find(Hash_table* h, const char* s) {     int hash_nr = hash(s, TABLE_SIZE);     Node* curr = h->buckets[hash_nr];     Node* n = 0;     while(curr != 0 && n == 0) {         if (!strcmp(curr->s, s))             n = curr;         curr = curr->next;     }     return n; } void build_dictionary(Hash_table* h, char* filename) {     ifstream inf(filename);     string s;     if (!inf) {         cout << "file not found" << endl;         exit(0);     }     while(getline(inf, s))         hash_insert(h, s.c_str()); }```
• 04-16-2002
Nick
Code:

`Secondly hashing is a sorting algorithm. There are two types of hashing algorithms:`
Sorting is when you have a sequence
(a1, a2, a3, a4 ...) and you permute the sequence so
that you have
(b1, b2, b3, b4 ...) with
b1 <= b2 <= b3 <= b4 so hashing really isn't
a sorting algorithm. I think you mean searching.
• 04-16-2002
shtarker
>>a sorting algorithm. I think you mean searching.

Thanks, the words never seem to come out propperly at 3 in the morning.
• 04-17-2002
Unregistered
Hashing produces an ideal situation where the speed of the search is the same for everything you might search for. Of course this is not easy, and collisions have to be resolved through the use of hashing functions.
• 04-17-2002
SilentStrike
I don't think these are ever deallocated...

Code:

```         Node* tmp = new Node;         tmp->s = new char[strlen(s) + 1];```
• 04-17-2002
Nutshell
How do u test for algorithms that produces no or very little collisions? Everybody uses their own version of hashing functions?

thnx
• 04-17-2002
Nick
Quote:

I don't think these are ever deallocated...
When the program exits, I wrote it kind of fast.

You can count the average number of collisions
for each insert. It should be close to one. Other
than that, the hash algorithm should usually use every bit of what your hashing and generate an integer [0, table_size - 1]. each key should be equally likely. Once you have the hash
algorithm you have several techniques for what
happens when you have collisions such as seperate chaining,
linear probing etc.