dispel hash mystique

**stabu** · 04-27-2010

There seems to be an aura about the hash function and table, and I was trying to come up with a succint, though informal, description which might say what it is and where generally it might be of use.

I'll admit that it's also for clearing it up for myself, as I haven't used them much. I have read about them (wikipedia etc) but the shorter description always seem too formal, and the long ones, well, end up being too, well, yes, long.

So, here goes:

"a hash function allows you to convert a string into a unique integer identifier which you can then subsequently use as representation of that string in a much more convenient format: namely the integer."

"You don't want to do the conversion each time the same strong pops up so you generate a look-up table of these strings to their integer representation, and that is your hash table.

"In many ways it's just an encoding of your set of strings into a set of integers."

Any corrections, helpful comments or wholescale deletions:-D, welcome thanks.

**Prelude** · 04-27-2010

Here's my most recent attempt:

Hash Tables
Hashing Algorithms

I suppose those would be among the "long" ones.

Super short dummies version:

A hash table is an array where anything can be used as an index by filtering that "anything" through a hash function. A hash function turns "anything" into an integer suitable for use as an array index.

Of course, once you get into collisions and collision resolution, dummies will have a harder time.

**stabu** · 04-28-2010

thanks Prelude for the answer. It looks and sounds good!

Shorter than mine too, and given your post count, easily more authoritative, so that's great.

Yes, I was only aiming for conceptual understanding ... so implementational problems such as collisions I would leave for later.

It's important to transmit the power of a concept early on, so when the difficulties of implementing start appearing you can refer back (often repeatedly) to how worthwhile it all will be,

Having got such a great answer, I really shouldn't be an opportunist, but ... would you able to mention any salient examples of applications where the hashtable does some major work?

I think in filesystems it's very prevalent, but that's a little too deep. I think databases yes? You can render a whole record as an "array index", could it be?

**MK27** · 04-28-2010

Originally Posted by stabu

There seems to be an aura about the hash function and table, and I was trying to come up with a succint, though informal, description which might say what it is and where generally it might be of use.

glad to hear somebody (else) making this observation. Most of the hash table "descriptions" I've seen are totally atrocious examples of eliptical and self-referential prose. Even the wikipedia page for "Hash Tables" is a total disaster along these lines -- it totally skips an explanation of what a hash table is and proceeds with a discussion of it's parts, which only makes sense if you already know what it is.

I'm getting a server error for both Prelude's links, unfortunately.

There are some problems with your definition, I will write another post about that in a minute...

**Adak** · 04-28-2010

Transposition tables in chess programs!!

**MK27** · 04-28-2010

Some problems with your definition:

"a hash function allows you to convert a string into a unique integer identifier which you can then subsequently use as representation of that string in a much more convenient format: namely the integer."

This is not true. You are not converting to a unique identifier at all. Or, at least, not necessarily. That is the significance of collisions. Collisions are an intrinsic aspect of most hash tables.

"You don't want to do the conversion each time the same strong pops up so you generate a look-up table of these strings to their integer representation, and that is your hash table.

This is somewhat confused. A hash table is not a lookup table in this sense, and you do need to do the string -> key conversion every time. The purpose of the hash table is not to save time looking up an integer key for a word, it is to save time iterating through a corresponding array.

"In many ways it's just an encoding of your set of strings into a set of integers."

No. This implies you could take a key and deduce what the string was just based on the integer. That is possible, it would be a form of encryption, it is not a hash table.

Different words can have the same key, for example:

Code:

#include <stdio.h>
#include <string.h>

#define TABLE_SIZE 12

int getKey (char *word) {
	int i, len = strlen(word), total = 0;
	for (i=0;i<len;i++) total += word[i];
	return total%TABLE_SIZE;
}


int main(int argc, const char *argv[]) {
	char *words[] = { "word", "static", "etc" };
	int i;

	for (i=0;i<3;i++) {
		printf("%10s %3d\n",words[i],getKey(words[i]));
	}

	return 0;
}

"word" and "static" here both have the same key. They both go in the first bucket of the hash table.

You might want to read these:
String hashing
Hash tables - open adressing for collision resolution

With regard to usefulness, yeah, hash tables can be used as/are used in databases. However, they also have a lot of "internal" usefulness too. For example, you have a list of files and sizes, you want to know the size for a particular file without iterating all the way through an array or list.

**Subsonics** · 04-28-2010

MK, I think your definition is a bit strict. If the requirements and design allows for it, it is possible to construct a hash table with out collisions, from a performance stand point it is also superior.

**MK27** · 04-28-2010

Originally Posted by Subsonics

MK, I think your definition is a bit strict. If the requirements and design allows for it, it is possible to construct a hash table with out collisions, from a performance stand point it is also superior.

First, I think the circumstances where that will be true are much more limited than ones where it is not. That is to say, it is not by definition superior from a performance standpoint, it only seems like it might be that way.* And in general, hash table use collisions. This is particularly true for generic libraries. Finally, a hash table without collisions would almost certainly need to be completely static, which significantly reduces it's usefulness. Anyway, I edited that post slightly to indicate that you do not "always" need a unique key, rather than saying you never use unique keys.

Second, if you do not understand how collisions work in a hash table, then you do not understand hashing, which is why I was drawing attention to this.

* for example, a hash table with a unique key for every word in the dictionary would require a rather expensive key generating algorithm, which you would have to overlook this to consider a 1:1 key/element a "performance advantage". It could easily be that using two layers of hash with binary trees on the end will give better performance, and provide a WAY more dynamic table. Which is why, in general, hash table use collisions.

**Subsonics** · 04-28-2010

Originally Posted by MK27

First, I think the circumstances where that will be true are much more limited than ones where it is not. That is to say, it is not by definition superior from a performance standpoint, it only seems like it might be that way.* And in general, hash table use collisions. This is particularly true for generic libraries. Finally, a hash table without collisions would almost certainly need to be completely static, which significantly reduces it's usefulness.

Of course it's superior by definition, if part of the definition is that it's implemented or chosen in a context where no collisions occur. You make it sound like collisions is a feature of hash tables when in fact it's something of an evil, and clever implementations tries hard to get rid of them as much as possible.

Originally Posted by MK27

Second, if you do not understand how collisions work in a hash table, then you do not understand hashing.

I have not said that I don't understand how collisions work in hash tables, it's simply you that draws the wrong conclusion based on assumptions. It's easy to see that it's impossible to come up with a way to guarantee a unique number for an infinite amount of possible strings.

Originally Posted by MK27

* for example, a hash table with a unique key for every word in the dictionary would require a rather expensive key generating algorithm, which you would have to overlook this to consider a 1:1 key/element a "performance advantage".

Well, thats why I said if the requirements and implementation allows for it, clearly you did not read that part.

Anyway, my last post was just a comment on the fact that a hash table without collisions does not end being a hash table.

**MK27** · 04-28-2010

Originally Posted by Subsonics

Of course it's superior by definition, if part of the definition is that it's implemented or chosen in a context where no collisions occur.

You mean if you define "mymethod" as superior, then "mymethod" is superior by definition. Okay, but that's just a tautology. You were talking about performance, and I was pointing out that a hash table implemented with no collisions does not necessarily provide better performance (memory usage, speed) than one which does, because you will require a more expensive key generating algorithm.

You make it sound like collisions is a feature of hash tables when in fact it's something of an evil, and clever implementations tries hard to get rid of them as much as possible.

They are a feature, because they are what permit hash tables to be dynamic. A clever implementation tries to deal with collisions cleverly, which means minimizing them to extent, not eliminating them.

I have not said that I don't understand how collisions work in hash tables, it's simply you that draws the wrong conclusion based on assumptions.

I didn't mean you, I meant the OP, who clearly does not understand the significance of collisions.

It's easy to see that it's impossible to come up with a way to guarantee a unique number for an infinite amount of possible strings.

Right, this is why it is misguided and naive to suggest that "eliminating all collisions" is a good approach to implementing a hash table. In fact "a way to guarantee a unique number for an infinite amount of possible strings" is not impossible, but is it not worthwhile, either. It's much better to focus on how collisions are dealt with rather than pursue such an alternative.

Well, thats why I said if the requirements and implementation allows for it, clearly you did not read that part.

Sure I did, that's what I meant by "circumstances", and I said the circumstances where a collisionless hash table is the best solution are much more limited. Eg, it requires a nearly or completely static table.

Anyway, my last post was just a comment on the fact that a hash table without collisions does not end being a hash table.

Okay. That's not what the OP asserted however; stabu's initial definition states the key to an entry is unique for each word, which would mean a hash table never has collisions. I'm just trying to point out the (very important) fact that most hash tables involve collisions.

**Subsonics** · 04-28-2010

Originally Posted by MK27

You mean if you define "mymethod" as superior, then "mymethod" is superior by definition.

No, that's not what I meant, but you seem to imply that a definition change if a condition is rare.

Originally Posted by MK27

You were talking about performance, and I was pointing out that a hash table implemented with no collisions does not necessarily provide better performance (memory usage, speed) than one which does, because you will require a more expensive key generating algorithm.

I'm talking about choosing the right implementation for the circumstance, if your data allows for it, and you have a chance of having unique keys, then it's faster. Let's say in a lookup table to avoid some expensive calculation, of course if the data amount is limited it might be questionable to use a hash table to begin with.

Originally Posted by MK27

They are a feature, because they are what permit hash tables to be dynamic. A clever implementation tries to deal with collisions cleverly, which means minimizing them to extent, not eliminating them.

Well, a hash table does not always have to be dynamic it again depends on the requirements.
You are just playing with words here, minimizing them but not get rid of them you say, so, part of the purpose is to make sure they are minimized but not made unique? Clearly, it might be impossible to avoid collisions, but that does not mean that it wouldn't be desirable.

Originally Posted by MK27

Right, this is why it is misguided and naive to suggest that "eliminating all collisions" is a good approach to implementing a hash table. In fact "a way to guarantee a unique number for an infinite amount of possible strings" is not impossible, but is it not worthwhile, either. It's much better to focus on how collisions are dealt with rather than pursue such an alternative.

I agree with this, for a general purpose implementation meant to be used as a default datatype of sort. But, different problems presents different opportunities to choose implementation, it might be that your data is limited, static and have qualities that allows for a unique key generation. That should not be dismissed by default either in my opinion.

Originally Posted by MK27

Sure I did, that's what I meant by "circumstances", and I said the circumstances where a collisionless hash table is the best solution are much more limited. Eg, it requires a nearly or completely static table.

Yes, or you can predict your input, or your input have a limited range and so on.

Originally Posted by MK27

Okay. That's not what the OP asserted however; stabu's initial definition states the key to an entry is unique for each word, which would mean a hash table never has collisions. I'm just trying to point out the (very important) fact that most hash tables involve collisions.

Agreed.

**MK27** · 04-28-2010

I was mostly concerned about leading the OP astray, because I've seen other people get caught up in this idea of using a unique key *before* they try a method of resolving collisions, because they think that will be better, when in most cases, it is not.

Originally Posted by Subsonics

I'm talking about choosing the right implementation for the circumstance, if your data allows for it, and you have a chance of having unique keys, then it's faster.

That's a lot of ifs. It is probably a good idea to draw a line such as the number of items in a linked list or b-tree bucket you would consider "slow enough" to merit a change. I don't think that limit should be 1, or anything less than 10, and probably closer to 100 for a b-tree. That means "collision free" is not a meaningful performance goal, but having a maximum bucket size might be.

Even with a completely static table -- all input pre-ordained -- I bet you a two layer approach (eg, so you have a hash table of hash tables, the first table obviously has collisions, the table for each node uses a different hash and has no collisions) is a way better approach than trying to come up with a means of generating all unique keys for the primary table, because all it costs is one more modulus operation.

Well, a hash table does not always have to be dynamic it again depends on the requirements.

I think most (actually, nearly all) uses for a hash table involve dynamic content.

You are just playing with words here, minimizing them but not get rid of them you say, so, part of the purpose is to make sure they are minimized but not made unique? Clearly, it might be impossible to avoid collisions, but that does not mean that it wouldn't be desirable.

I was concerned about the word unique, yeah, because generating unique integer ID's for strings is not the goal of a hash table (read the OP). But you're right, saying they should never be unique is a bit strict (I did edit that post, btw).

**quzah** · 04-28-2010

You can think of a dictionary as being a hash table. The key is the first letter of the word. In the English language, that means you get 26 buckets. Now, there will be a lot of collisions, since there are more than 26 words in the English language, so you have to resolve them somehow.

We do that by sorting them alphabetically. Think about if you didn't have a hash table for your dictionary. What if you wanted to look up the word 'cat'? How would you do it? Well, assuming the dictionary was just one big long list of words, you'd start at the beginning and read through every word until you found cat. But since we've split them up into groups based on the first letter, you can jump right to the C section, and start reading through from there.

There, hash tables explained simply.

Quzah.

**cas** · 04-28-2010

would you able to mention any salient examples of applications where the hashtable does some major work?

I can give you a very simple example of where I used a hash table to massively speed up a particular piece of code. This code traverses a filesystem, and needs to keep track of which inodes it has seen. On my system, inodes are 64-bit, which means that an inode might have a value even greater than 1.8e19.

One simple way of keeping track of inodes is to push them onto a stack as they're seen. A basic linked list can take care of that. But here's the problem: each time an inode is found, it needs to be checked against all others to see if it's already been recorded. So as the list grows, you need to potentially traverse each element; what if I'm scanning a million files? By the end, you're running through a linked list with a million elements. If you've ever wondered what O(n) meant, this is it.

A fast option would be to use an array with the inode as the index. Arrays are quick access (just an addition and dereference). But with 1.8e19 possible entries, I really don't think I could create an array that big. So I combined the two approaches. There are other possibilities, but this method was extremely simple and is quite fast.

First, I created an array of 8192 elements, each of which is (potentially) the head of a linked list. Every time I find an inode, I take the remainder of it divided by 8192, and use that to index my array. I then create a linked list at that location, storing all inodes that hash to that particular value (so inode 10000 and inode 18192 are both linked to the entry 1808).

The division/remainder is fast, and the indexing into the array is fast. The only slow part would be the linear search through the inodes at each array entry (I don't sort them because it doesn't really matter for my application). How slow? Well, assuming one million inodes as previously, on average each bucket of the hash will contain around 120 inodes. Scanning 120 elements each time you find an inode is much more efficient than scanning a million.

Now, my hashing algorithm (if you can even call it that) is very, very simple, and not completely uniform; but when I tested it, it distributed inodes (at least on an XFS filesystem) quite well, and was certainly fast enough for my needs. Much, much faster than the single linked list (and practical, unlike the single array).

This shows not only how a hash might be faster than other algorithms, but also how you needn't necessarily use a string to do a key lookup, as Prelude mentioned. I'm simply turning a large number into a smaller number.

**nonoob** · 04-28-2010

My view of a hash table is that it is an encoding of data elements into a more compact form.
That's it.

The presumption is that there will be a significant advantage in manipulating (usually searching) the compacted elements. Faster speed and smaller storage, just to name the most obvious.

Yet most fast schemes which convert one representation into a smaller fixed-size one is akin to lossy compression. Hence all the talk about collisions to resolve where multiple original elements mapped to a single destination element. This would render them ambiguous without additional stack/chain/linked-list/tree structures.

Thread: dispel hash mystique

Thread Tools

Search Thread

Display

dispel hash mystique

Similar Threads

Help understanding Hash Table

Problem with unaligned intrinsics

need help with hash tables with direct chaining

Group Project Help/Volunteer

Not sure on hash table memory allocations