Thread: Hash function for two numbers

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    50

    Hash function for two numbers

    Hello,

    I am writing a program that generates random graphs given the number of nodes and the number of edges. Right now I just generate two random numbers x and y and add the edge (x, y) if it doesn't already exist. Too check for existence, I use a hash table.

    My first idea was to have a hash function that takes x and y as parameters, computes c = x + y then adds an entry in the hash table at position c % hash_table_size (a prime number, I chose 666013). This obviously generates a lot of collisions, and wouldn't generate different values for edges made of nodes that add up to the same value ((1, 3) would have the same hash as (2, 2)).

    This idea generated an edge list and printed it to a file of a graph with 10 000 nodes and 1 000 000 edges in about 10 seconds. This is just to make an idea, I realise it's nowhere near a good benchmark.

    Then, I quickly came up with this simple hash function:

    Code:
    unsigned int getkey(unsigned int x, unsigned int y)
    {
       return ((x << 19) | (y << 7));
    }
    This sped things up a lot. For the same input, the graph was generated and printed to file in about 3.6 seconds. Almost three times faster. Without printing to file, this approach takes about 1.8 seconds.

    Now, I realise there are a lot more things involved in the performance of the program, like how I generate the random numbers (I just use x = 1 + (rand() * rand()) % n and y = same until the x and y generated are not present in the hash, then print that edge and hash it...). I just want to know if there's anything that I can improve about the hash function that might speed things up. I did some testing, and it seems my function generates mostly very high numbers, so I think maybe a lot of the 666013 keys go unused. For the same test I mentioned before, the longest chain seems to be around 12 on average, at the end of the program.

    So, is there any good hash function for what I am trying to do? Any other place that I could optimize in?

  2. #2
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  3. #3
    Registered User
    Join Date
    May 2006
    Posts
    50
    I tried the hash functions listed there, but they don't seem to make any difference. Some even seem to be slower.

    I guess I'll have to look into other areas I can optimze in.

    Still, maybe someone knows a hash function that works best for two numbers?

  4. #4
    Registered User
    Join Date
    Jan 2008
    Posts
    290
    What's the largest graph size you need to generate? If it's not that large, I would just store the graph as an adjacency table and forget the whole hash table idea. I just wrote a program that does this and it generated a 10,000 node / 1,000,000 edge graph in .36 seconds.

    The only thing that I can think of off the top of my head that might work slightly better as a hash function would be:
    Code:
    unsigned int getkey(unsigned int x, unsigned int y)
    {
       return (x << 16) | (y && 0xFFFF);
    }
    which creates the key using the 16 lowest bits of each number.
    This guarantees that all combinations of x and y values under 2^16 hash to a different key, which is great for any graph with less than 2^16 nodes. Of course, after modulus is applied they might still end up in the same bucket.

  5. #5
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Hash functions usually are used on larger data items where the processing cost of caluclting the hash is significantly lower than the cost to compare the object with the target. In this case I don;t see the calculation as takign much less than actually comparing the vales unless you are using the hash to jump into the database thus skipping a lot of obvious mismatches.

  6. #6
    Registered User
    Join Date
    May 2006
    Posts
    50
    I guess it wouldn't get much bigger than 10 000 nodes / 1 000 000 edges. Can you detail how you generated it using an adjacency table (did you mean adjacency lists?)? How did you make sure no edge is added twice? If you iterate a node's adjacency list for each edge you attempt to add, it's excruciatingly slow... so how did you do it?

    The hash function you suggested has the same performance as mine as far as i can tell...

    Maybe I'm missing something. I just generate two random numbers, consider them as defining an edge, checking if that edge was already added (this is where the hash table comes in), if not, add it to the hash, if yes, repeat.

  7. #7
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by Sfel View Post
    My first idea was to have a hash function that takes x and y as parameters, computes c = x + y then adds an entry in the hash table at position c % hash_table_size (a prime number, I chose 666013). This obviously generates a lot of collisions, and wouldn't generate different values for edges made of nodes that add up to the same value ((1, 3) would have the same hash as (2, 2)).
    Modding the hash with a prime number has always been a terrible idea. It's like putting on tons of deoderant after doing lots of sweaty exercise, instead of having a shower. You're mod with a prime number only attempts to make up for, and hide, the fact that the hash function is utter crap.
    Code:
    unsigned int getkey(unsigned int x, unsigned int y)
    {
       return ((x << 19) | (y << 7));
    }
    And here we have the crappy hash function.
    Assuming 32 bits ints, we have:
    Lowest 7 bits are always all zero.
    26 of the 64 bits are thrown away completely.
    13 of the bits are ored together, increasing the chance that those bits are ones, to 75%.

    I absolutely always recommend using a CRC algorithm for the hash. I can't stress enough how good of a job it does as a hash function for a hash table. I've used it numerous times and the results are nothing short of excellent. It is also extremely fast using a lookup table.
    Just treat the integers as a buffer of 8 bytes and hash all those bytes.
    Also, use a hash table size that is always a power of two so that you can do a fast bit-masking with the resulting hash to find the bin for the item.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  8. #8
    Registered User
    Join Date
    Jan 2008
    Posts
    290
    Quote Originally Posted by Sfel View Post
    I guess it wouldn't get much bigger than 10 000 nodes / 1 000 000 edges. Can you detail how you generated it using an adjacency table (did you mean adjacency lists?)? How did you make sure no edge is added twice? If you iterate a node's adjacency list for each edge you attempt to add, it's excruciatingly slow... so how did you do it?

    The hash function you suggested has the same performance as mine as far as i can tell...

    Maybe I'm missing something. I just generate two random numbers, consider them as defining an edge, checking if that edge was already added (this is where the hash table comes in), if not, add it to the hash, if yes, repeat.
    I did NOT mean an adjacency list. There are 3 main ways to store a graph:

    1. Adjacency List
    2. Edge List
    3. Adjacency Matrix (Table)

    If your graph has N nodes and E edges then:

    Using adjacency lists you would have an array of N node lists. The list at index i would contain all the nodes adjacent to node i.

    Using an edge list you would just have a list of E edges, where each edge is just a pair of nodes.

    Using an adjacency matrix you store an NxN grid of boolean values. If grid[a][b] is true, then there is an edge between nodes a and b. This is by far the most common graph representation I have ever seen because in general, among all the operations you can perform on a graph, it tends to be the fastest. However using O(n^2) space is very wasteful for large sparse graphs.

    Generating a random graph with an adjacency matrix is very simple:
    Code:
    graph = calloc(N*N, sizeof(int));  // create a large enough zero-filled graph
    
    for (i = 0; i < E; i++)
       node1 = pick a random node
       node2 = pick a random node
       if there isn't an edge between node1 and node2
          add the edge between node1 and node2
       else
          decrement i
    Keep in mind that if you want bi-directional edges, then adding an edge between node 3 and node 5 would require 2 operations:

    graph[2][5] = 1;
    graph[5][2] = 1;

  9. #9
    Registered User
    Join Date
    May 2006
    Posts
    50
    Oh, you meant adjacency matrix. I'm more used to this term, that's why I didn't figure that's what you meant when I read table, though I should've. Yes, this should be the fastest time-wise, although not memory wise.

    I could get the memory down to N*N/32 by using one bit to store a 1/0, but that's still around 12 megs for 10 000 nodes. I don't know if this can be improved any further. If it could, this would definitely beat a hash table... Otherwise, I guess it depends on what kind of graph is needed.

    Quote Originally Posted by iMalc
    I absolutely always recommend using a CRC algorithm for the hash. I can't stress enough how good of a job it does as a hash function for a hash table. I've used it numerous times and the results are nothing short of excellent. It is also extremely fast using a lookup table.
    Just treat the integers as a buffer of 8 bytes and hash all those bytes.
    Also, use a hash table size that is always a power of two so that you can do a fast bit-masking with the resulting hash to find the bin for the item.
    Could you perhaps recommend a CRC algorithm that is relatively easy to implement? I'm not at all familiar with these. I tried some of the ones on the eternallyconfuzzled webpage, I changed the hash table size to a power of two, I hashed the 8 bytes separately and I got the same performance as my original, admittedly crappy, function. I'm surprised myself, I didn't expect that function to do so well.

  10. #10
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  11. #11
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. In over my head
    By Shelnutt2 in forum C Programming
    Replies: 1
    Last Post: 07-08-2008, 06:54 PM
  2. Undefined Reference Compiling Error
    By AlakaAlaki in forum C++ Programming
    Replies: 1
    Last Post: 06-27-2008, 11:45 AM
  3. We Got _DEBUG Errors
    By Tonto in forum Windows Programming
    Replies: 5
    Last Post: 12-22-2006, 05:45 PM
  4. Contest Results - May 27, 2002
    By ygfperson in forum A Brief History of Cprogramming.com
    Replies: 18
    Last Post: 06-18-2002, 01:27 PM
  5. Interface Question
    By smog890 in forum C Programming
    Replies: 11
    Last Post: 06-03-2002, 05:06 PM