Thread: Very fast hashing of single integer

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226

    Very fast hashing of single integer

    I am looking for a very fast (at most a few CPU cycles) algorithm to hash an 64-bit integer into another 64-bit integer.

    The main objective is that similar input values should produce very different output values.

    It does not need to be cryptographically secure.

    One obvious approach is to pre-compute a random number for each possible input value, and use that as the hash, but the table would be way too big for 64-bit.

    It can be done for each part (for example, 16-bit parts) instead, with the results xor-ed together, but that's quite a few instructions -

    Code:
    rand0[4][65536]; // all randomly generated
    
    uint16_t parts[4];
    
    parts[0] = x & 0xffff;
    parts[1] = (x >> 16) & 0xffff;
    parts[2] = (x >> 32) & 0xffff;
    parts[3] = (x >> 48) & 0xffff;
    
    hash = rand[0][parts[0]] ^ rand[1][parts[1] ^ rand[2][parts[2] ^ rand[3][part[3]]
    It also has the problem that the big rand table takes up 2MB, and will generate many cache misses.

    Does anyone know of a faster way?

    Any algorithm with L1 misses will probably be too slow, since an L1 miss is about 10 cycles, or about 40 simple arithmetic instructions (4 instructions per cycle on most modern Intel CPUs).

  2. #2
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Also, the input domain is numbers with up to 8 arbitrary bits set.

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    You aren't mixing any bits of the source that way.

    *shrug*

    I guess one would try using smaller primes to mutate bytes or words within the string if one required a hash in the fewest instructions possible.

    Code:
    unsigned long long Hash
    (
        unsigned long long fValue
    )
    {
        const unsigned long long kPrime = (~0 - 83 + 1);
        const unsigned long long kMultiplier = 0x00F100F100F100F1;
        fValue ^= fValue * kMultiplier;
        fValue ^= fValue * kPrime;
        return(fValue);
    }
    I don't really like such an idea without serious testing, but I like not using the actual value in the mix even less.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  4. #4
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Is something like MurmurHash too slow?

  5. #5
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Quote Originally Posted by phantomotap View Post
    O_o

    You aren't mixing any bits of the source that way.

    *shrug*

    I guess one would try using smaller primes to mutate bytes or words within the string if one required a hash in the fewest instructions possible.

    Code:
    unsigned long long Hash
    (
        unsigned long long fValue
    )
    {
        const unsigned long long kPrime = (~0 - 83 + 1);
        const unsigned long long kMultiplier = 0x00F100F100F100F1;
        fValue ^= fValue * kMultiplier;
        fValue ^= fValue * kPrime;
        return(fValue);
    }
    I don't really like such an idea without serious testing, but I like not using the actual value in the mix even less.

    Soma
    Thanks, will definitely give that a try.

    Is there anything fundamentally wrong with not using bits in the input, if it satisfies all the properties of a good hash (or is there a property that doesn't satisfy)?

  6. #6
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Quote Originally Posted by MutantJohn View Post
    Is something like MurmurHash too slow?
    I have never heard of it but just looked it up. Pretty cool!

    It does seem to be a little too slow, though.

  7. #7
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Is there anything fundamentally wrong with not using bits in the input, if it satisfies all the properties of a good hash (or is there a property that doesn't satisfy)?
    O_o

    I realize that the circumstances would be unlikely, but the generated values could conceivably cancel out in any of several ways.

    Let's pretend that `rand[1][$1] ^ rand[2][$2] ^ rand[3][$3]' evaluates to `$4' regardless of the corresponding values.

    If any other `rand[1][$4] ^ rand[2][$5] ^ rand[3][$6]' evaluates to `$4', the lowest bits are solely responsible for changing the hashed value.

    Yes. I again do realize the circumstances would be unlikely. I just don't like the idea than any bit could ever be excluded from changing the computed hash.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  8. #8
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Quote Originally Posted by phantomotap View Post
    O_o

    I realize that the circumstances would be unlikely, but the generated values could conceivably cancel out in any of several ways.

    Let's pretend that `rand[1][$1] ^ rand[2][$2] ^ rand[3][$3]' evaluates to `$4' regardless of the corresponding values.

    If any other `rand[1][$4] ^ rand[2][$5] ^ rand[3][$6]' evaluates to `$4', the lowest bits are solely responsible for changing the hashed value.

    Yes. I again do realize the circumstances would be unlikely. I just don't like the idea than any bit could ever be excluded from changing the computed hash.

    Soma
    Is there a theoretical basis for this being a bad thing?

    In this case, I wouldn't say the higher bits are excluded from changing the computed hash. If they were different, the computed hash would be different, so they had just as much of a role in determining the hash as the lower bits.

  9. #9
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Is there a theoretical basis for this being a bad thing?
    O_o

    Yes. Of course, the research is focused on cryptography.

    Yes. I know that you don't care about the strength of the function.

    I only said I don't like the idea.

    In this case, I wouldn't say the higher bits are excluded from changing the computed hash. If they were different, the computed hash would be different, so they had just as much of a role in determining the hash as the lower bits.
    No. The higher order bits would not play any role in determining the hash in situation I described.

    You should really just read the explanation again if you don't understand the situation I described, but I will offer an example.

    Code:
    // ...
    Table[1][0] = 641671;
    // ...
    Table[2][0] = 751345;
    // ...
    Table[3][0] = 179318;
    // ...
    Code:
    Value = Value & 0xffff;
    // ...
    Word[0] = Value & 0xffff; // Word[0] = Value;
    Word[1] = (Value >> 16) & 0xffff; // Word[1] = 0;
    Word[2] = (Value >> 32) & 0xffff; // Word[2] = 0;
    Word[3] = (Value >> 48) & 0xffff; // Word[3] = 0;
    // ...
    Fragment = Table[1][Word[1]] ^ Table[2][Word[2]] ^ Table[3][Word[3]]; // Fragment = 0;
    // ...
    Hash = Fragment ^ Table[Word[0]]; // Hash = Table[Word[0]];
    Code:
    // ...
    Table[3][(1 << 15)] = 179318;
    // ...
    Code:
    Value = Value & 0xffff;
    // ...
    Value |= 1 << 63;
    // ...
    Word[0] = Value & 0xffff; // Word[0] = Value;
    Word[1] = (Value >> 16) & 0xffff; // Word[1] = 0;
    Word[2] = (Value >> 32) & 0xffff; // Word[2] = 0;
    Word[3] = (Value >> 48) & 0xffff; // Word[3] = 1 << 15;
    // ...
    Fragment = Table[1][Word[1]] ^ Table[2][Word[2]] ^ Table[3][Word[3]]; // Fragment = 0;
    // ...
    Hash = Fragment ^ Table[Word[0]]; // Hash = Table[Word[0]];
    Code:
    Value1 = Value & 0xffff;
    Value2 = Value1 | (1 << 63);
    Hash(Value1) == Hash(Value2);
    As I said, the circumstances would be unlikely.

    I still don't like the idea that any bit could ever be excluded from changing the computed hash.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  10. #10
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    I have read your example multiple times, and still don't think I get it.

    Are you essentially trying to say Table[3][1 << 15] can happen to be equal to Table[3][0]? Because that's the only case the 2 Fragments can be equal, assuming the other entries of the table are the same.

    Isn't that essentially the same as finding a collision in the 3-words hash function (that generates Fragment)?

    Does this property make collisions more likely? Since all the table entries are randomly generated, I would think that makes this function more uniformly distributed, not less?

  11. #11
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Out of curiosity, is this too many instructions for you?
    Code:
    int hash( int k ) 
    {
        k *= 357913941;
        k ^= k << 24;
        k += ~357913941;
        k ^= k >> 31;
        k ^= k << 31;
    
    
        return k;
    }

  12. #12
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Quote Originally Posted by MutantJohn View Post
    Out of curiosity, is this too many instructions for you?
    Code:
    int hash( int k ) 
    {
        k *= 357913941;
        k ^= k << 24;
        k += ~357913941;
        k ^= k >> 31;
        k ^= k << 31;
    
    
        return k;
    }
    That seems fast enough (extended to 64-bit). What's the story behind that?

  13. #13
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Oh, it's just some code I found from the authors of gFlip3D so I don't know much about it but it seems quite condensed. I also trust the authors' wisdom. Their other code is pretty good so I'm trusting that this as well.

    This is the code in pure source form :
    Code:
    // originally CUDA code
    __forceinline__ __device__ float hash( int k ) 
    {
        k *= 357913941;
        k ^= k << 24;
        k += ~357913941;
        k ^= k >> 31;
        k ^= k << 31;
    
    
        return int_as_float( k ); 
    }
    Source : 3D Delaunay Triangulation on the GPU

    The code is in KerPredicates.cu of gDel3D.

    Though I'm not sure about it being originally intended for floating point numbers. I don't know if that'd make a good integer hash. I guess the only way would be to test it.

  14. #14
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,226
    Quote Originally Posted by MutantJohn View Post
    Oh, it's just some code I found from the authors of gFlip3D so I don't know much about it but it seems quite condensed. I also trust the authors' wisdom. Their other code is pretty good so I'm trusting that this as well.

    This is the code in pure source form :
    Code:
    // originally CUDA code
    __forceinline__ __device__ float hash( int k ) 
    {
        k *= 357913941;
        k ^= k << 24;
        k += ~357913941;
        k ^= k >> 31;
        k ^= k << 31;
    
    
        return int_as_float( k ); 
    }
    Source : 3D Delaunay Triangulation on the GPU

    The code is in KerPredicates.cu of gDel3D.

    Though I'm not sure about it being originally intended for floating point numbers. I don't know if that'd make a good integer hash. I guess the only way would be to test it.
    Ah! Good old CUDA.

    The constants are 0x155555..., so it basically shifts the number by all even amounts, then xor them together. And to make sure the lower bits aren't too static, the upper bits are mixed into lower bits.

    I suppose that could work.

    Will just have to verify experimentally that this doesn't cause too many collisions.

  15. #15
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Does this property make collisions more likely?
    O_o

    Let me try a different approach to explaining the reason behind my comment: despite having a 64bit result, you don't really have a 64bit hash. The 64bit result of each component is just an arbitrarily lengthened 16bit hash. You have four 16bit hashes which you've combined with the `XOR' operation. I'm basically guaranteed to find a collision in 262144 attempts.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Some help developing a super-fast integer test?
    By Sebastiani in forum General Discussions
    Replies: 11
    Last Post: 11-22-2013, 06:03 PM
  2. Integer Hashing Function
    By TimmerCA in forum C Programming
    Replies: 2
    Last Post: 12-18-2010, 10:36 AM
  3. Changing a single bit of an integer
    By Tiago in forum C Programming
    Replies: 8
    Last Post: 04-18-2010, 05:11 AM
  4. Single Entry Single Exit
    By robwhit in forum C Programming
    Replies: 5
    Last Post: 11-11-2007, 01:34 PM
  5. What is hashing?
    By Nutshell in forum A Brief History of Cprogramming.com
    Replies: 8
    Last Post: 04-17-2002, 10:06 PM