15th August 2018, 2 min read

Fast strongly universal 64-bit hashing everywhere!

In software, hashing is the process of taking a value and mapping it to a random-looking value. Suppose you are given 64-bit integers (a long in Java). You might want to “hash” these integers to other 64-bit values.

There are many good ways to achieve this result, but let me add some constraints:

The hashing should be strongly universal, also called pairwise independent. This means that if h is a hash function picked at random, then you want the knowledge that h(x)=y for some x and some y, to give you absolutely no information about the value of h(x') for x' distinct from x. That’s not, as it appears, a security/cryptography issue, but more of a useful constraint for probabilistic algorithms. Indeed, there are many “probabilistic” algorithms that require different, and “independent”, hash functions. You want to be absolutely sure that your hash functions are unrelated. Strong universality is not perfect independence, but it is pretty good in practice.
You don’t want to have large look-up tables occupying your cache.
You want to code that works efficiently in most programming languages (including, say, Java).

My proposal is as follows. Instead of trying to hash the 64-bit values to other 64-bit values directly, we can hash them to 32-bit values. If you repeat twice (using two different hash functions), you get the 64-bit result you seek. All that is needed per hash function are three 64-bit values chosen at random, and then two multiplications, two additions and a single shift. The two multiplications are faster than they appear because they can be executed simultaneously as there is no data dependency. Two get a full 64-bit output, you thus need four multiplications.

// in Java, long is 64-bit, int 32-bit

long a,b,c; // randomly assigned 64-bit values

int hash32(long x) {
  int low = (int)x;
  int high = (int)(x >>> 32);
  return (int)((a * low + b * high + c) >>> 32);
}

How fast is it? We can try to compare it against a standard bit-mixing function:

long murmur64(long h) {
  h ^= h >>> 33;
  h *= 0xff51afd7ed558ccdL;
  h ^= h >>> 33;
  h *= 0xc4ceb9fe1a85ec53L;
  h ^= h >>> 33;
  return h;
}

This bit-mixing function is “obviously” faster. It has half the number of multiplications, and none of the additions. However, in my tests, the difference is less than you might expect (only about 50%). Moreover if you do need two 32-bit hash values, the 64-bit mixing function loses much of its edge and is only about 25% faster.

My code is available so you can run your own Java benchmark.

How do I know that this hashing function is, indeed, strongly universal? We wrote a paper about it: Strongly universal string hashing is fast. At the time, we were interested in hashing strings. The trick is to view a 64-bit word as a string of two 32-bit words. We could extend the same trick to 128-bit inputs or, indeed, inputs of any length.