Daniel Lemire's blog

, 10 min read

Fast strongly universal 64-bit hashing everywhere!

10 thoughts on “Fast strongly universal 64-bit hashing everywhere!”

  1. icsa says:

    Is there a bug in lines 35-40 of HashFast.java?

    static long hash64(long x) {
    int low = (int)x;
    int high = (int)(x >>> 32);
    return ((a1 * low + b1 * high + c1) >>> 32)
    | ((a2 * low + b2 * high + c2) & 0xFFFFFFFFL);

    Should 0xFFFFFFFFL be 0xFFFFFFFF00000000L ?

    1. Indeed. Fixed. Thank you.

  2. Markus says:
  3. Peter McNeeley says:

    64 bit Hash collision
    x = 0xe105df2c79eef65cL;
    x’ = 0x31d1b0f8ede6981fL;

    h(x) = h(x’) == -9096877159865265073


    1. Peter: Strongly universal means that if the constants (a,b,c…) are generated randomly, and I give you x and h(x), and x’ is distinct from x, then you know nothing about h(x’). In some sense, it ensures that there is a possible collision in that if you knew that h(x) and h(x’) could not collide, then it would not be strongly universal.

      1. Peter McNeeley says:

        Thanks for your response.
        ” if you knew that h(x) and h(x’) could not collide”
        This is what I did notice. for x’ = x + dx where dx is a small integer there are ZERO (32 bit) collisions whereas murmur and random produce ~same number collisions.

        1. Peter McNeeley says:

          The default values for a1,b1,c1,a2,b2,c2 produce the effect mentioned. Other specific for a1,b1,c1,a2,b2,c2 produce order of magnitude more collisions for nearby values than would be expected from random.

  4. Travis Downs says:

    Most languages offer a multiply of two N-bit values into an N-bit result, but actual hardware often offers the full 2N-bit result for no little extra cost (for example, on x86, getting the 128-bit product of two 64-bit multiplies has the same latency and throughput as the variants that only give a 64-bit result, although they do cost an extra uop).

    Given that, could you calculate your 64-bit strongly universal hash more directly, using only two multiplications if you had access to the full 128-bit result of a 64 * 64 multiplication?

    If I understand it correctly, you are using the multiply-shift scheme to calculate an N-bit hash of an N-bit input x using: (a*x + b) >> N, where a and b are 2N-bit values and x is N-bit.

    So given N = 64, on most hardware we can do a 64 * 64 -> 128 bit multiplication, but for the a * x part we actually need 128 * 64 -> 128, which could be implemented with two 64 * 64 -> 128 multiplies and a 64-bit add. Then you have two more adds for the + b part, although in principle it seems like adding the bottom half of b is mostly pointless because it only very weakly affects the actual result since the bottom bits are all thrown away [1]. The shift >> 64 is a no-op, since we’re splitting the 128-bit result into two 64-bit registers, so we just directly use the top half.

    So you end up with half the number of multiplies and fewer additions as well.

    The trick, of course, is convincing the language of your choice to generate sensible code under the covers that actually uses the hardware capabilities! From C or C++ this is somewhat easier due to the presence of 128-bit types for most compilers. I’m not sure about Java though.

    [1] Of course, you’d have to walk through the strong universality proof to see if dropping the low-bit addition breaks the guarantee in theory. Perhaps it ends up being strongly 1.0000001-universal or something.

    1. Travis Downs says:

      I guess this kind of shows that there is a parallel between the explicit decomposition approach described in Daniels post and multi-precision arithmetic.

      I.e., one way to extend the multiply-shift scheme to larger words is decompose it explicitly into smaller hashes and concatenate the hashes together to get your word-sized hash, another another way is to just perform the multiply-shift formula once with the full word size and rely on multi-precision arithmetic to calculate the answer when the needed arithmetic exceeds the machine’s word size.

      The end up scaling in about the same way: both are quadratic in the word size (at least as I understand Daniel’s approach) and in fact ultimately produce similar operations.

      The explicit decomposition approach has the advantage that you can perhaps rely on some knowledge of the required result to eliminate or combine operations. For example, Daniel’s approach only uses 4 multiplications while a naive 64128->128 multiplication would need 8 (I think). Essentially the decomposition was able to work around the lack of a way to get the high-half of a 6464-bit multiplication in Java.

      The multi-precision arithmetic approach has the advantage of being able to lean on existing multi-precision libraries[1], which may end up being faster (i.e., if they know how to get a 64*64->128-bit result, you cut the multiplications down to 2 as in my suggestion above), and being obviously correct without proof since they use the original multiply-shift-add formula directly.

      [1] Of course if you actually use some big heavy multi-precision library maybe that’s also a downside.

  5. Philip Reames says:

    Thanks for an interesting article as always.

    This is the type of result which could be very influenced by your choice of compiler and VM. For the heck of it, I went and ran your tests on Azul Zing. Overall, the big picture results are roughly the same, but the percentage difference between the hash32 and murmur scores was a bit bigger (80% on Zing vs 65% on Zulu).

    (a random recent zulu8 snapshot - e.g. openjdk)

    Benchmark Mode Cnt Score Error Units
    HashFast.fast2_32 avgt 10 235.174 ± 0.125 us/op
    HashFast.fast64 avgt 10 205.715 ± 0.651 us/op
    HashFast.murmur avgt 10 143.495 ± 0.044 us/op
    HashFast.murmur_32 avgt 10 181.772 ± 0.441 us/op

    (a random recent Zing8 snapshot)

    Benchmark Mode Cnt Score Error Units
    HashFast.fast2_32 avgt 10 134.482 ± 0.026 us/op
    HashFast.fast64 avgt 10 122.611 ± 0.832 us/op
    HashFast.murmur avgt 10 74.235 ± 0.017 us/op
    HashFast.murmur_32 avgt 10 87.856 ± 0.024 us/op

    Results were collected on a skylake-client box I had sitting around. I tested on two other architectures (an ancient AMD, and haswell) and got roughly the same picture (though the absolute scores differed of course).

    Disclaimer: I work for Azul on our compilers, so I’m obviously biased.