Thread: Help with a hashing algorithm!

  1. #1
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665

    Help with a hashing algorithm!

    Okay, so here's what I want out of the algorithm :

    I want to hash 3 integers into one unique key. They input should be degenerate as in a, b, c hash to the same value as c, b , a is the same as b, c, a, etc.

    I have a prototype now but I'm getting collisions.

    Basically, I googled and thought that I should reverse the bits of the operands and add them all together assuming that each int would hash to a unique value.

    Here's my algorithm now :
    Code:
    unsigned long reverse(unsigned x)
    {
        const int prime_factor = 17;
        const int prime_offset = 3;
        unsigned long x_copy = x * prime_factor + prime_offset;
        assert(x_copy <= UINT_MAX);
        x = x_copy;
    
    
        x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
        x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
        x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
        x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
        return((x >> 16) | (x << 16));
    }
    
            // hash faces
            unsigned long hash[4] = { 0 };
    
    
            hash[0] = reverse(t.w) + reverse(t.z) + reverse(t.y); // 321
            hash[1] = reverse(t.x) + reverse(t.z) + reverse(t.w); // 023
            hash[2] = reverse(t.x) + reverse(t.w) + reverse(t.y); // 031
            hash[3] = reverse(t.x) + reverse(t.y) + reverse(t.z); // 012
    Here t is a structure with members, x, y, z, w.

    Here is some example of collisions :
    Code:
    Found bad tuple!
    4 => 58 2 45 13 => 3279945728
    5 => 42 45 1 29 => 3279945728
    11 => 0 58 45 13 => 3279945728
    Found bad tuple!
    5 => 42 45 1 29 => 3279945728
    11 => 0 58 45 13 => 3279945728
    15 => 0 45 42 29 => 3279945728
    How do I fix this?
    Last edited by MutantJohn; 01-30-2015 at 07:38 PM.

  2. #2
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    By definition, you can not have hashing without collisions.
    Your hash table has to account for those.

    (And inventing your own hash function is not a good idea, as I understand it.)

    Here is the approach I use, slightly modified for your question.
    Code:
    #include <iostream>
    #include <algorithm>
    #include <functional>
    #include <array>
    int pair(int x, int y, int mod = 99721 ) // put a much larger value here
    {
        int result = (x + y)*(x + y + 1)/2 + y;
        return result % mod;
    }
    int hash(std::array<int, 3> input)
    {
        std::sort(input.begin(), input.end());
        std::transform(input.begin(), input.end(), input.begin(), std::hash<int>());
        return pair(input[0], pair(input[1], input[2])); 
    }
    int main()
    {
        std::cout << hash({1,2,3})<<' '<< hash({2,3,1});
    }
    (Everyone welcome to point out flaws)
    Last edited by manasij7479; 01-30-2015 at 09:50 PM.

  3. #3
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Hey, thanks for the reply!

    I'll try out your algorithm but unfortunately, I can't use std::hash. I know, very lame. But I looked it up and I guess the C++ library uses MurmurHash as its crux algorithm so I'm going to try setting this up first.

    Edit : The code doesn't seem to be behaving for me properly. I'll get some matches but they're few and way too far inbetween for it to be accurate for me. Hmm... Idk what I did.

    In my code, I'm making lots of sets of three. The problem I have now is that while the hash of 3 ints will be degenerate, I get random combinations that sum up to the same stuff when not all the numbers match.
    Last edited by MutantJohn; 01-31-2015 at 12:58 AM.

  4. #4
    Lurker
    Join Date
    Dec 2004
    Posts
    296
    Quote Originally Posted by MutantJohn View Post
    Hey, thanks for the reply!

    I'll try out your algorithm but unfortunately, I can't use std::hash. I know, very lame. But I looked it up and I guess the C++ library uses MurmurHash as its crux algorithm so I'm going to try setting this up first.

    Edit : The code doesn't seem to be behaving for me properly. I'll get some matches but they're few and way too far inbetween for it to be accurate for me. Hmm... Idk what I did.

    In my code, I'm making lots of sets of three. The problem I have now is that while the hash of 3 ints will be degenerate, I get random combinations that sum up to the same stuff when not all the numbers match.
    We don't know what you did either, because you didn't show us the new code...

    As manasij7479 said, if you hash something, then you are bound to get collisions. If your hash function is bad, then everything can hash to the same key. A good hash algorithm will spread your data uniformly across the space, but you will still get collisions.

  5. #5
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Let's start at the beginning of the problem.

    You obviously have N points, and you are forming triplets out of them (avoiding degenerate triplets, i.e. no triplets with the same point more than once). The problem is that you need to know if the triplet has already been defined. Since there are N(N-1)(N-2) possible triplets, this is an efficiency issue; a bottleneck.

    A simple way to find out if a triplet i1, i2, i3 is unique, is to sort the points so that i1 < i2 < i3. You maintain a sorted array of the i1 that occur in currently known triplets, with each entry referring to an array of sorted i2 that occurred in triplets with that i1 , each entry in that second array referring to an array of sorted i3 occurring in the triplets with the parent i1, i1. The data structures you need could be something similar to
    Code:
    struct inner_array {
        size_t  elements;
        int *id;
    };
    
    struct middle_array {
        size_t  elements;
        int *id;
        struct inner_array *array;
    };
    
    struct outer_array {
        size_t  elements;
        int  *id;
        struct middle_array *array;
    };
    The idea being that you keep the indexes in id sorted, with each element having a corresponding pointer in array, so you can look for the specific index using a binary search. To make that binary search as efficient as possible, you keep the pointers separate, to increase the cache locality during the binary search. (A binary search is already nasty cache-access wise, and chunking one to group levels to cacheline sizes leads to write-only code that is hard for compilers to optimize; in my opinion, keeping the arrays separate is optimization enough here.)

    Doing at least one binary search for each triplet is now the efficiency-limiting bottleneck. Here, a hash table can be used to usually avoid the binary search.

    Note: You use the hash table to address the question "Is it possible we've already seen this triplet?"

    If you do not need to know the identities of the triplets for anything besides avoiding repeating them, and you're not too tight on memory, you can just use the hash table entry to list all the triplets that hashed to that entry. Then, you don't need the nested-array structure above at all!


    We've now come a full circle, let's look at the properties of the hash needed.

    If we don't know anything useful about the distribution of the indexes hashed, and especially if we don't know the N, we're best off with a hash function with a good avalanche effect -- any bit changing in the input would change half the bits in the hash --, and for uniform random inputs, have an uniform per-bit result distribution.

    One quirk here is that sorting the three indexes is actually slow. (It requires either hardware min/max support, or conditional jumps. SSE on x86 and x86-64 does have integer min/max, but the setup to move the indexes to a vector register, then back, makes it probably not worth the effort.)
    This means that a commutative hash function ( hash(a,b,c) == hash(a,c,b) == hash(b,a,c) == hash(b,c,a) == hash(c,a,b) == hash(c,b,a)) will let you postpone sorting the indexes. (You'll still have to sort them if the hash matches, to compare against the other hashes.)

    In C, ^ | & + operators are all fast and commutative. * is commutative, but if both terms are variables, tends to be comparatively slow. Personally, I'd investigate the following type of hash function:
    Code:
    static inline unsigned int hash3(const unsigned int i1, const unsigned int i2, const unsigned int i3)
    {
        return (i1 ^ i2 ^ i3) * FACTOR1 + (i1 | i2 | i3) * FACTOR2 + (i1 & i2 & i3) * FACTOR3;
    }
    where the constant factors are chosen to have about half the bits set, so that any change will change the resulting hash enough. You can also experiment with adding fourth term ((i1 + i2 + i3) * FACTOR4) and experiment changing the three additions in between the four multiplications to xor or or, too.

    Another option is to take the four expressions -- i1^i2^i3, i1|i2|i3, i1&i2&3,and i1+i2+i3 --and look at doing a few rounds of a cryptographic hash function on them (bit shifts and adds and xors to mix up the term better than a simple multiplication).

    The rate of collisions in this function is pretty critical; in particular, for uniformly random inputs, you'll want the collisions to be rather uniform, too, and not cluster on specific results.

    The data structure you could use could be
    Code:
    typedef struct {
        unsigned int hash;
        int          i1;
        int          i2; /* i1 < i2 < i3 */
        int          i3;
    } triplet;
    
    typedef struct {
        unsigned int num_allocated;
        unsigned int num_triplets;
        triplet triplets[];
    } hash_entry;
    
    typedef struct {
        unsigned int  size;
        hash_entry **entry;
    } hash_table;
    The idea above is that you always store the hash for each triplet, but use hash % size to pick the hash table entry. You initialize all entries to NULL, and allocate memory for each entry as it grows. When the entry grows too large (remember cache architecture; having something like say up to 8 triplets in a single entry should not be appreciably slower than having just one entry), you can regenerate the hash table without recalculating any of the hashes (because you stored the original ones), just by creating a new hash_table with a suitable number of entries, adding all the old hashes to the new one, then discarding the old one.

    Questions?

  6. #6
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Okay, I tried your advice Nominal and it seems like the algorithm is working much better now.

    But I'm still getting a collision though it seems at a much lower rate right now.

    Here's my current code :
    Code:
    // brute force sort (optimize later)
    __device__
    void sort(int *arr)
    {
        for (int i = 0; i < 3; ++i)  
        {  
            for (int j = 0; j < 3 - i - 1; ++j)
            {
                if (arr[j] > arr[j + 1])  
                {  
                    const int temp = arr[j];  
                    arr[j] = arr[j + 1];  
                    arr[j + 1] = temp;  
                }
            }
        } 
    }
    
    // this passes the assert
    __device__
    unsigned hash3(unsigned i1, 
                   unsigned i2, 
                   unsigned i3)
    {
        assert(i1 <= i2 && i2 <= i3);
    
    
        i1 = __brev(i1); // reverse bits
        i2 = __brev(i2);
        i3 = __brev(i3);
     
        return (i1 ^ i2 ^ i3) * __popc((i1 ^ i2 ^ i3)) +  // here __popc returns the number of set bits
               (i1 | i2 | i3) * __popc((i1 | i2 | i3)) + 
               (i1 & i2 & i3) * __popc((i1 & i2 & i3)) +
               (i1 + i2 + i3) * __popc((i1 + i2 + i3)) +
               i1 % 99751 + i2 % 37 + i3 % 23;
    }
    
    
    
           //load in point values
            int pts[4][3] = { { t.z, t.z, t.y },
                              { t.x, t.z, t.w },
                              { t.x, t.w, t.y },
                              { t.x, t.y, t.z } };
    
    
            sort(pts[0]);
            sort(pts[1]);
            sort(pts[2]);
            sort(pts[3]);
        
            hash[0] = hash3(pts[0][0], pts[0][1], pts[0][2]);
            hash[1] = hash3(pts[1][0], pts[1][1], pts[1][2]);
            hash[2] = hash3(pts[2][0], pts[2][1], pts[2][2]);
            hash[3] = hash3(pts[3][0], pts[3][1], pts[3][2]);
    Basically, I'm trying to come up with a massively parallel mesh linking technique in CUDA. My idea was simple, hash each face of the tetrahedron and then sort the output and if two adjacent values are equal, we can link the tetrahedra through that face. This is also a fantastic error check because it will catch any instances of duplicated tetrahedra in the mesh.

    As of now, it's these two triples that are giving me errors :
    Code:
    { 0, 40, 45 }
    /* and */
    {0, 51, 67  }
    Granted, the rate of collision is much lower now so I might just settle with dealing with these collisions manually.

  7. #7
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Okay, maybe someone can help me with this. Why is MurmurHash3 giving me different answers for the same data?

    Code:
    //-----------------------------------------------------------------------------
    // MurmurHash3 was written by Austin Appleby, and is placed in the public
    // domain. The author hereby disclaims copyright to this source code.
    
    
    // Note - The x86 and x64 versions do _not_ produce the same results, as the
    // algorithms are optimized for their respective platforms. You can still
    // compile and run any of them on any platform, but your performance with the
    // non-native version will be less than optimal.
    
    
    #include <stdint.h>
    #include <stdlib.h>
    
    
    #include <iostream>
    
    
    inline uint32_t rotl32 ( uint32_t x, int8_t r )
    {
      return (x << r) | (x >> (32 - r));
    }
    
    
    inline uint64_t rotl64 ( uint64_t x, int8_t r )
    {
      return (x << r) | (x >> (64 - r));
    }
    
    
    #define ROTL32(x,y)     rotl32(x,y)
    #define ROTL64(x,y)     rotl64(x,y)
    
    
    #define BIG_CONSTANT(x) (x##LLU)
    
    
    
    
    
    
    //-----------------------------------------------------------------------------
    // Block read - if your platform needs to do endian-swapping or can only
    // handle aligned reads, do the conversion here
    
    
    uint32_t getblock ( const uint32_t * p, int i )
    {
      return p[i];
    }
    
    
    uint64_t getblock ( const uint64_t * p, int i )
    {
      return p[i];
    }
    
    
    //-----------------------------------------------------------------------------
    // Finalization mix - force all bits of a hash block to avalanche
    
    
    uint32_t fmix ( uint32_t h )
    {
      h ^= h >> 16;
      h *= 0x85ebca6b;
      h ^= h >> 13;
      h *= 0xc2b2ae35;
      h ^= h >> 16;
    
    
      return h;
    }
    
    
    //----------
    
    
    uint64_t fmix ( uint64_t k )
    {
      k ^= k >> 33;
      k *= BIG_CONSTANT(0xff51afd7ed558ccd);
      k ^= k >> 33;
      k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
      k ^= k >> 33;
    
    
      return k;
    }
    
    
    
    
    void MurmurHash3_x64_128 ( const void * key, const int len,
                               const uint32_t seed, void * out )
    {
      const uint8_t * data = (const uint8_t*)key;
      const int nblocks = len / 16;
    
    
      uint64_t h1 = seed;
      uint64_t h2 = seed;
    
    
      uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
      uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
    
    
      //----------
      // body
    
    
      const uint64_t * blocks = (const uint64_t *)(data);
    
    
      for(int i = 0; i < nblocks; i++)
      {
        uint64_t k1 = getblock(blocks,i*2+0);
        uint64_t k2 = getblock(blocks,i*2+1);
    
    
        k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
    
    
        h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
    
    
        k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
    
    
        h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
      }
    
    
      //----------
      // tail
    
    
      const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
    
    
      uint64_t k1 = 0;
      uint64_t k2 = 0;
    
    
      switch(len & 15)
      {
      case 15: k2 ^= uint64_t(tail[14]) << 48;
      case 14: k2 ^= uint64_t(tail[13]) << 40;
      case 13: k2 ^= uint64_t(tail[12]) << 32;
      case 12: k2 ^= uint64_t(tail[11]) << 24;
      case 11: k2 ^= uint64_t(tail[10]) << 16;
      case 10: k2 ^= uint64_t(tail[ 9]) << 8;
      case  9: k2 ^= uint64_t(tail[ 8]) << 0;
               k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
    
    
      case  8: k1 ^= uint64_t(tail[ 7]) << 56;
      case  7: k1 ^= uint64_t(tail[ 6]) << 48;
      case  6: k1 ^= uint64_t(tail[ 5]) << 40;
      case  5: k1 ^= uint64_t(tail[ 4]) << 32;
      case  4: k1 ^= uint64_t(tail[ 3]) << 24;
      case  3: k1 ^= uint64_t(tail[ 2]) << 16;
      case  2: k1 ^= uint64_t(tail[ 1]) << 8;
      case  1: k1 ^= uint64_t(tail[ 0]) << 0;
               k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
      };
    
    
      //----------
      // finalization
    
    
      h1 ^= len; h2 ^= len;
    
    
      h1 += h2;
      h2 += h1;
    
    
      h1 = fmix(h1);
      h2 = fmix(h2);
    
    
      h1 += h2;
      h2 += h1;
    
    
      ((uint64_t*)out)[0] = h1;
      ((uint64_t*)out)[1] = h2;
    }
    
    
    
    
    int main(void)
    {
        unsigned key1[4] = { 24, 45, 1, 0 };
        uint64_t hash1[2] = { 0 };
    
    
        MurmurHash3_x64_128(key1, 128, 37, hash1);
        std::cout << hash1[0] << hash1[1] << std::endl;
    
    
        MurmurHash3_x64_128(key1, 128, 37, hash1);
        std::cout << hash1[0] << hash1[1] << std::endl;
    
    
        return 0;
    }
    Sample output :
    Code:
    // trial 1
    118762093629132207883948936087444433227
    157199471080621039609401770861441920942
    
    // trial 2
    58400437854147964114156455005320558265
    364902726158230048418441744983965199973
    
    // trial 3
    1696412492942147158513939515530920709014
    134167512960876324493927275459590902960

  8. #8
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by MutantJohn View Post
    Code:
        unsigned key1[4] = { 24, 45, 1, 0 };
        MurmurHash3_x64_128(key1, 128, 37, hash1);
    The len parameter is in bytes, not bits. You declare a key of 4*32 bits (assuming a typical architecture with 32-bit unsigned ints), i.e. 16 bytes, but tell your hash function it should hash 128 bytes. So, you have a classic buffer overrun. Because you use values from the stack, and those change after every call, you get different results. If you fix the length from 128 to 16, you'll get consistent results (by not using a semirandom stack-garbage key).

  9. #9
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Omg, you were so right. That's funny because as I was downloading the code I was thinking to myself what that parameter was. Was it bits or bytes? And I didn't see anything that said either way so I just assumed bits. Thank you for clearing that up!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. string hashing algorithm
    By cyberfish in forum Tech Board
    Replies: 4
    Last Post: 04-11-2009, 12:25 PM
  2. Replies: 8
    Last Post: 09-11-2006, 11:26 AM
  3. simple hashing algorithm??
    By iori in forum C Programming
    Replies: 7
    Last Post: 04-14-2003, 05:18 PM
  4. hashing
    By condorx in forum C++ Programming
    Replies: 2
    Last Post: 01-31-2003, 09:36 PM
  5. Hashing Help
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 06-05-2002, 01:14 PM