Thread: Hash function for two numbers

1. Hash function for two numbers

Hello,

I am writing a program that generates random graphs given the number of nodes and the number of edges. Right now I just generate two random numbers x and y and add the edge (x, y) if it doesn't already exist. Too check for existence, I use a hash table.

My first idea was to have a hash function that takes x and y as parameters, computes c = x + y then adds an entry in the hash table at position c % hash_table_size (a prime number, I chose 666013). This obviously generates a lot of collisions, and wouldn't generate different values for edges made of nodes that add up to the same value ((1, 3) would have the same hash as (2, 2)).

This idea generated an edge list and printed it to a file of a graph with 10 000 nodes and 1 000 000 edges in about 10 seconds. This is just to make an idea, I realise it's nowhere near a good benchmark.

Then, I quickly came up with this simple hash function:

Code:
```unsigned int getkey(unsigned int x, unsigned int y)
{
return ((x << 19) | (y << 7));
}```
This sped things up a lot. For the same input, the graph was generated and printed to file in about 3.6 seconds. Almost three times faster. Without printing to file, this approach takes about 1.8 seconds.

Now, I realise there are a lot more things involved in the performance of the program, like how I generate the random numbers (I just use x = 1 + (rand() * rand()) % n and y = same until the x and y generated are not present in the hash, then print that edge and hash it...). I just want to know if there's anything that I can improve about the hash function that might speed things up. I did some testing, and it seems my function generates mostly very high numbers, so I think maybe a lot of the 666013 keys go unused. For the same test I mentioned before, the longest chain seems to be around 12 on average, at the end of the program.

So, is there any good hash function for what I am trying to do? Any other place that I could optimize in?

2. I tried the hash functions listed there, but they don't seem to make any difference. Some even seem to be slower.

I guess I'll have to look into other areas I can optimze in.

Still, maybe someone knows a hash function that works best for two numbers?

3. What's the largest graph size you need to generate? If it's not that large, I would just store the graph as an adjacency table and forget the whole hash table idea. I just wrote a program that does this and it generated a 10,000 node / 1,000,000 edge graph in .36 seconds.

The only thing that I can think of off the top of my head that might work slightly better as a hash function would be:
Code:
```unsigned int getkey(unsigned int x, unsigned int y)
{
return (x << 16) | (y && 0xFFFF);
}```
which creates the key using the 16 lowest bits of each number.
This guarantees that all combinations of x and y values under 2^16 hash to a different key, which is great for any graph with less than 2^16 nodes. Of course, after modulus is applied they might still end up in the same bucket.

4. Hash functions usually are used on larger data items where the processing cost of caluclting the hash is significantly lower than the cost to compare the object with the target. In this case I don;t see the calculation as takign much less than actually comparing the vales unless you are using the hash to jump into the database thus skipping a lot of obvious mismatches.

5. I guess it wouldn't get much bigger than 10 000 nodes / 1 000 000 edges. Can you detail how you generated it using an adjacency table (did you mean adjacency lists?)? How did you make sure no edge is added twice? If you iterate a node's adjacency list for each edge you attempt to add, it's excruciatingly slow... so how did you do it?

The hash function you suggested has the same performance as mine as far as i can tell...

Maybe I'm missing something. I just generate two random numbers, consider them as defining an edge, checking if that edge was already added (this is where the hash table comes in), if not, add it to the hash, if yes, repeat.

6. Originally Posted by Sfel
My first idea was to have a hash function that takes x and y as parameters, computes c = x + y then adds an entry in the hash table at position c % hash_table_size (a prime number, I chose 666013). This obviously generates a lot of collisions, and wouldn't generate different values for edges made of nodes that add up to the same value ((1, 3) would have the same hash as (2, 2)).
Modding the hash with a prime number has always been a terrible idea. It's like putting on tons of deoderant after doing lots of sweaty exercise, instead of having a shower. You're mod with a prime number only attempts to make up for, and hide, the fact that the hash function is utter crap.
Code:
```unsigned int getkey(unsigned int x, unsigned int y)
{
return ((x << 19) | (y << 7));
}```
And here we have the crappy hash function.
Assuming 32 bits ints, we have:
Lowest 7 bits are always all zero.
26 of the 64 bits are thrown away completely.
13 of the bits are ored together, increasing the chance that those bits are ones, to 75%.

I absolutely always recommend using a CRC algorithm for the hash. I can't stress enough how good of a job it does as a hash function for a hash table. I've used it numerous times and the results are nothing short of excellent. It is also extremely fast using a lookup table.
Just treat the integers as a buffer of 8 bytes and hash all those bytes.
Also, use a hash table size that is always a power of two so that you can do a fast bit-masking with the resulting hash to find the bin for the item.

7. Originally Posted by Sfel
I guess it wouldn't get much bigger than 10 000 nodes / 1 000 000 edges. Can you detail how you generated it using an adjacency table (did you mean adjacency lists?)? How did you make sure no edge is added twice? If you iterate a node's adjacency list for each edge you attempt to add, it's excruciatingly slow... so how did you do it?

The hash function you suggested has the same performance as mine as far as i can tell...

Maybe I'm missing something. I just generate two random numbers, consider them as defining an edge, checking if that edge was already added (this is where the hash table comes in), if not, add it to the hash, if yes, repeat.
I did NOT mean an adjacency list. There are 3 main ways to store a graph:

2. Edge List

If your graph has N nodes and E edges then:

Using adjacency lists you would have an array of N node lists. The list at index i would contain all the nodes adjacent to node i.

Using an edge list you would just have a list of E edges, where each edge is just a pair of nodes.

Using an adjacency matrix you store an NxN grid of boolean values. If grid[a][b] is true, then there is an edge between nodes a and b. This is by far the most common graph representation I have ever seen because in general, among all the operations you can perform on a graph, it tends to be the fastest. However using O(n^2) space is very wasteful for large sparse graphs.

Generating a random graph with an adjacency matrix is very simple:
Code:
```graph = calloc(N*N, sizeof(int));  // create a large enough zero-filled graph

for (i = 0; i < E; i++)
node1 = pick a random node
node2 = pick a random node
if there isn't an edge between node1 and node2
add the edge between node1 and node2
else
decrement i```
Keep in mind that if you want bi-directional edges, then adding an edge between node 3 and node 5 would require 2 operations:

graph[2][5] = 1;
graph[5][2] = 1;

8. Oh, you meant adjacency matrix. I'm more used to this term, that's why I didn't figure that's what you meant when I read table, though I should've. Yes, this should be the fastest time-wise, although not memory wise.

I could get the memory down to N*N/32 by using one bit to store a 1/0, but that's still around 12 megs for 10 000 nodes. I don't know if this can be improved any further. If it could, this would definitely beat a hash table... Otherwise, I guess it depends on what kind of graph is needed.

Originally Posted by iMalc
I absolutely always recommend using a CRC algorithm for the hash. I can't stress enough how good of a job it does as a hash function for a hash table. I've used it numerous times and the results are nothing short of excellent. It is also extremely fast using a lookup table.
Just treat the integers as a buffer of 8 bytes and hash all those bytes.
Also, use a hash table size that is always a power of two so that you can do a fast bit-masking with the resulting hash to find the bin for the item.
Could you perhaps recommend a CRC algorithm that is relatively easy to implement? I'm not at all familiar with these. I tried some of the ones on the eternallyconfuzzled webpage, I changed the hash table size to a power of two, I hashed the 8 bytes separately and I got the same performance as my original, admittedly crappy, function. I'm surprised myself, I didn't expect that function to do so well.