Help with a hashing algorithm!

**MutantJohn** · 01-30-2015

Okay, so here's what I want out of the algorithm :

I want to hash 3 integers into one unique key. They input should be degenerate as in a, b, c hash to the same value as c, b , a is the same as b, c, a, etc.

I have a prototype now but I'm getting collisions.

Basically, I googled and thought that I should reverse the bits of the operands and add them all together assuming that each int would hash to a unique value.

Here's my algorithm now :

Code:

unsigned long reverse(unsigned x)
{
    const int prime_factor = 17;
    const int prime_offset = 3;
    unsigned long x_copy = x * prime_factor + prime_offset;
    assert(x_copy <= UINT_MAX);
    x = x_copy;


    x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
    x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
    x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
    x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
    return((x >> 16) | (x << 16));
}

        // hash faces
        unsigned long hash[4] = { 0 };


        hash[0] = reverse(t.w) + reverse(t.z) + reverse(t.y); // 321
        hash[1] = reverse(t.x) + reverse(t.z) + reverse(t.w); // 023
        hash[2] = reverse(t.x) + reverse(t.w) + reverse(t.y); // 031
        hash[3] = reverse(t.x) + reverse(t.y) + reverse(t.z); // 012

Here t is a structure with members, x, y, z, w.

Here is some example of collisions :

Code:

Found bad tuple!
4 => 58 2 45 13 => 3279945728
5 => 42 45 1 29 => 3279945728
11 => 0 58 45 13 => 3279945728
Found bad tuple!
5 => 42 45 1 29 => 3279945728
11 => 0 58 45 13 => 3279945728
15 => 0 45 42 29 => 3279945728

How do I fix this?

**manasij7479** · 01-30-2015

By definition, you can not have hashing without collisions.
Your hash table has to account for those.

(And inventing your own hash function is not a good idea, as I understand it.)

Here is the approach I use, slightly modified for your question.

Code:

#include <iostream>
#include <algorithm>
#include <functional>
#include <array>
int pair(int x, int y, int mod = 99721 ) // put a much larger value here
{
    int result = (x + y)*(x + y + 1)/2 + y;
    return result % mod;
}
int hash(std::array<int, 3> input)
{
    std::sort(input.begin(), input.end());
    std::transform(input.begin(), input.end(), input.begin(), std::hash<int>());
    return pair(input[0], pair(input[1], input[2])); 
}
int main()
{
    std::cout << hash({1,2,3})<<' '<< hash({2,3,1});
}

(Everyone welcome to point out flaws)

**MutantJohn** · 01-31-2015

Hey, thanks for the reply!

I'll try out your algorithm but unfortunately, I can't use std::hash. I know, very lame. But I looked it up and I guess the C++ library uses MurmurHash as its crux algorithm so I'm going to try setting this up first.

Edit : The code doesn't seem to be behaving for me properly. I'll get some matches but they're few and way too far inbetween for it to be accurate for me. Hmm... Idk what I did.

In my code, I'm making lots of sets of three. The problem I have now is that while the hash of 3 ints will be degenerate, I get random combinations that sum up to the same stuff when not all the numbers match.

**Jimmy** · 01-31-2015

Originally Posted by MutantJohn

Hey, thanks for the reply!

I'll try out your algorithm but unfortunately, I can't use std::hash. I know, very lame. But I looked it up and I guess the C++ library uses MurmurHash as its crux algorithm so I'm going to try setting this up first.

Edit : The code doesn't seem to be behaving for me properly. I'll get some matches but they're few and way too far inbetween for it to be accurate for me. Hmm... Idk what I did.

In my code, I'm making lots of sets of three. The problem I have now is that while the hash of 3 ints will be degenerate, I get random combinations that sum up to the same stuff when not all the numbers match.

We don't know what you did either, because you didn't show us the new code...

As manasij7479 said, if you hash something, then you are bound to get collisions. If your hash function is bad, then everything can hash to the same key. A good hash algorithm will spread your data uniformly across the space, but you will still get collisions.

**Nominal Animal** · 01-31-2015

Let's start at the beginning of the problem.

You obviously have N points, and you are forming triplets out of them (avoiding degenerate triplets, i.e. no triplets with the same point more than once). The problem is that you need to know if the triplet has already been defined. Since there are N(N-1)(N-2) possible triplets, this is an efficiency issue; a bottleneck.

A simple way to find out if a triplet i₁, i₂, i₃ is unique, is to sort the points so that i₁ < i₂ < i₃. You maintain a sorted array of the i₁ that occur in currently known triplets, with each entry referring to an array of sorted i₂ that occurred in triplets with that i₁ , each entry in that second array referring to an array of sorted i₃ occurring in the triplets with the parent i₁, i₁. The data structures you need could be something similar to

Code:

struct inner_array {
    size_t  elements;
    int *id;
};

struct middle_array {
    size_t  elements;
    int *id;
    struct inner_array *array;
};

struct outer_array {
    size_t  elements;
    int  *id;
    struct middle_array *array;
};

The idea being that you keep the indexes in id sorted, with each element having a corresponding pointer in array, so you can look for the specific index using a binary search. To make that binary search as efficient as possible, you keep the pointers separate, to increase the cache locality during the binary search. (A binary search is already nasty cache-access wise, and chunking one to group levels to cacheline sizes leads to write-only code that is hard for compilers to optimize; in my opinion, keeping the arrays separate is optimization enough here.)

Doing at least one binary search for each triplet is now the efficiency-limiting bottleneck. Here, a hash table can be used to usually avoid the binary search.

Note: You use the hash table to address the question "Is it possible we've already seen this triplet?"

If you do not need to know the identities of the triplets for anything besides avoiding repeating them, and you're not too tight on memory, you can just use the hash table entry to list all the triplets that hashed to that entry. Then, you don't need the nested-array structure above at all!

We've now come a full circle, let's look at the properties of the hash needed.

If we don't know anything useful about the distribution of the indexes hashed, and especially if we don't know the N, we're best off with a hash function with a good avalanche effect -- any bit changing in the input would change half the bits in the hash --, and for uniform random inputs, have an uniform per-bit result distribution.

One quirk here is that sorting the three indexes is actually slow. (It requires either hardware min/max support, or conditional jumps. SSE on x86 and x86-64 does have integer min/max, but the setup to move the indexes to a vector register, then back, makes it probably not worth the effort.)
This means that a commutative hash function ( hash(a,b,c) == hash(a,c,b) == hash(b,a,c) == hash(b,c,a) == hash(c,a,b) == hash(c,b,a)) will let you postpone sorting the indexes. (You'll still have to sort them if the hash matches, to compare against the other hashes.)

In C, ^ | & + operators are all fast and commutative. * is commutative, but if both terms are variables, tends to be comparatively slow. Personally, I'd investigate the following type of hash function:

Code:

static inline unsigned int hash3(const unsigned int i1, const unsigned int i2, const unsigned int i3)
{
    return (i1 ^ i2 ^ i3) * FACTOR1 + (i1 | i2 | i3) * FACTOR2 + (i1 & i2 & i3) * FACTOR3;
}

where the constant factors are chosen to have about half the bits set, so that any change will change the resulting hash enough. You can also experiment with adding fourth term ((i1 + i2 + i3) * FACTOR4) and experiment changing the three additions in between the four multiplications to xor or or, too.

Another option is to take the four expressions -- i1^i2^i3, i1|i2|i3, i1&i2&3,and i1+i2+i3 --and look at doing a few rounds of a cryptographic hash function on them (bit shifts and adds and xors to mix up the term better than a simple multiplication).

The rate of collisions in this function is pretty critical; in particular, for uniformly random inputs, you'll want the collisions to be rather uniform, too, and not cluster on specific results.

The data structure you could use could be

Code:

typedef struct {
    unsigned int hash;
    int          i1;
    int          i2; /* i1 < i2 < i3 */
    int          i3;
} triplet;

typedef struct {
    unsigned int num_allocated;
    unsigned int num_triplets;
    triplet triplets[];
} hash_entry;

typedef struct {
    unsigned int  size;
    hash_entry **entry;
} hash_table;

The idea above is that you always store the hash for each triplet, but use hash % size to pick the hash table entry. You initialize all entries to NULL, and allocate memory for each entry as it grows. When the entry grows too large (remember cache architecture; having something like say up to 8 triplets in a single entry should not be appreciably slower than having just one entry), you can regenerate the hash table without recalculating any of the hashes (because you stored the original ones), just by creating a new hash_table with a suitable number of entries, adding all the old hashes to the new one, then discarding the old one.

Questions?

**MutantJohn** · 02-01-2015

Okay, I tried your advice Nominal and it seems like the algorithm is working much better now.

But I'm still getting a collision though it seems at a much lower rate right now.

Here's my current code :

Code:

// brute force sort (optimize later)
__device__
void sort(int *arr)
{
    for (int i = 0; i < 3; ++i)  
    {  
        for (int j = 0; j < 3 - i - 1; ++j)
        {
            if (arr[j] > arr[j + 1])  
            {  
                const int temp = arr[j];  
                arr[j] = arr[j + 1];  
                arr[j + 1] = temp;  
            }
        }
    } 
}

// this passes the assert
__device__
unsigned hash3(unsigned i1, 
               unsigned i2, 
               unsigned i3)
{
    assert(i1 <= i2 && i2 <= i3);


    i1 = __brev(i1); // reverse bits
    i2 = __brev(i2);
    i3 = __brev(i3);
 
    return (i1 ^ i2 ^ i3) * __popc((i1 ^ i2 ^ i3)) +  // here __popc returns the number of set bits
           (i1 | i2 | i3) * __popc((i1 | i2 | i3)) + 
           (i1 & i2 & i3) * __popc((i1 & i2 & i3)) +
           (i1 + i2 + i3) * __popc((i1 + i2 + i3)) +
           i1 % 99751 + i2 % 37 + i3 % 23;
}



       //load in point values
        int pts[4][3] = { { t.z, t.z, t.y },
                          { t.x, t.z, t.w },
                          { t.x, t.w, t.y },
                          { t.x, t.y, t.z } };


        sort(pts[0]);
        sort(pts[1]);
        sort(pts[2]);
        sort(pts[3]);
    
        hash[0] = hash3(pts[0][0], pts[0][1], pts[0][2]);
        hash[1] = hash3(pts[1][0], pts[1][1], pts[1][2]);
        hash[2] = hash3(pts[2][0], pts[2][1], pts[2][2]);
        hash[3] = hash3(pts[3][0], pts[3][1], pts[3][2]);

Basically, I'm trying to come up with a massively parallel mesh linking technique in CUDA. My idea was simple, hash each face of the tetrahedron and then sort the output and if two adjacent values are equal, we can link the tetrahedra through that face. This is also a fantastic error check because it will catch any instances of duplicated tetrahedra in the mesh.

As of now, it's these two triples that are giving me errors :

Code:

{ 0, 40, 45 }
/* and */
{0, 51, 67  }

Granted, the rate of collision is much lower now so I might just settle with dealing with these collisions manually.

**MutantJohn** · 02-01-2015

Okay, maybe someone can help me with this. Why is MurmurHash3 giving me different answers for the same data?

Code:

//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.


// Note - The x86 and x64 versions do _not_ produce the same results, as the
// algorithms are optimized for their respective platforms. You can still
// compile and run any of them on any platform, but your performance with the
// non-native version will be less than optimal.


#include <stdint.h>
#include <stdlib.h>


#include <iostream>


inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}


inline uint64_t rotl64 ( uint64_t x, int8_t r )
{
  return (x << r) | (x >> (64 - r));
}


#define ROTL32(x,y)     rotl32(x,y)
#define ROTL64(x,y)     rotl64(x,y)


#define BIG_CONSTANT(x) (x##LLU)






//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here


uint32_t getblock ( const uint32_t * p, int i )
{
  return p[i];
}


uint64_t getblock ( const uint64_t * p, int i )
{
  return p[i];
}


//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche


uint32_t fmix ( uint32_t h )
{
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;


  return h;
}


//----------


uint64_t fmix ( uint64_t k )
{
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xff51afd7ed558ccd);
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
  k ^= k >> 33;


  return k;
}




void MurmurHash3_x64_128 ( const void * key, const int len,
                           const uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;


  uint64_t h1 = seed;
  uint64_t h2 = seed;


  uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);


  //----------
  // body


  const uint64_t * blocks = (const uint64_t *)(data);


  for(int i = 0; i < nblocks; i++)
  {
    uint64_t k1 = getblock(blocks,i*2+0);
    uint64_t k2 = getblock(blocks,i*2+1);


    k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;


    h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;


    k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;


    h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
  }


  //----------
  // tail


  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);


  uint64_t k1 = 0;
  uint64_t k2 = 0;


  switch(len & 15)
  {
  case 15: k2 ^= uint64_t(tail[14]) << 48;
  case 14: k2 ^= uint64_t(tail[13]) << 40;
  case 13: k2 ^= uint64_t(tail[12]) << 32;
  case 12: k2 ^= uint64_t(tail[11]) << 24;
  case 11: k2 ^= uint64_t(tail[10]) << 16;
  case 10: k2 ^= uint64_t(tail[ 9]) << 8;
  case  9: k2 ^= uint64_t(tail[ 8]) << 0;
           k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;


  case  8: k1 ^= uint64_t(tail[ 7]) << 56;
  case  7: k1 ^= uint64_t(tail[ 6]) << 48;
  case  6: k1 ^= uint64_t(tail[ 5]) << 40;
  case  5: k1 ^= uint64_t(tail[ 4]) << 32;
  case  4: k1 ^= uint64_t(tail[ 3]) << 24;
  case  3: k1 ^= uint64_t(tail[ 2]) << 16;
  case  2: k1 ^= uint64_t(tail[ 1]) << 8;
  case  1: k1 ^= uint64_t(tail[ 0]) << 0;
           k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  };


  //----------
  // finalization


  h1 ^= len; h2 ^= len;


  h1 += h2;
  h2 += h1;


  h1 = fmix(h1);
  h2 = fmix(h2);


  h1 += h2;
  h2 += h1;


  ((uint64_t*)out)[0] = h1;
  ((uint64_t*)out)[1] = h2;
}




int main(void)
{
    unsigned key1[4] = { 24, 45, 1, 0 };
    uint64_t hash1[2] = { 0 };


    MurmurHash3_x64_128(key1, 128, 37, hash1);
    std::cout << hash1[0] << hash1[1] << std::endl;


    MurmurHash3_x64_128(key1, 128, 37, hash1);
    std::cout << hash1[0] << hash1[1] << std::endl;


    return 0;
}

Sample output :

Code:

// trial 1
118762093629132207883948936087444433227
157199471080621039609401770861441920942

// trial 2
58400437854147964114156455005320558265
364902726158230048418441744983965199973

// trial 3
1696412492942147158513939515530920709014
134167512960876324493927275459590902960

**Nominal Animal** · 02-01-2015

Originally Posted by MutantJohn

Code:

    unsigned key1[4] = { 24, 45, 1, 0 };
    MurmurHash3_x64_128(key1, 128, 37, hash1);

The len parameter is in bytes, not bits. You declare a key of 4*32 bits (assuming a typical architecture with 32-bit unsigned ints), i.e. 16 bytes, but tell your hash function it should hash 128 bytes. So, you have a classic buffer overrun. Because you use values from the stack, and those change after every call, you get different results. If you fix the length from 128 to 16, you'll get consistent results (by not using a semirandom stack-garbage key).

**MutantJohn** · 02-01-2015

Omg, you were so right. That's funny because as I was downloading the code I was thinking to myself what that parameter was. Was it bits or bytes? And I didn't see anything that said either way so I just assumed bits. Thank you for clearing that up!

Thread: Help with a hashing algorithm!

Thread Tools

Search Thread

Display

Help with a hashing algorithm!

Similar Threads

string hashing algorithm

What hashing algorithm outputs hash value as numbers only?

simple hashing algorithm??

hashing

Hashing Help