, 4 min read
Two 32-bit hash functions from a 64-bit hash function?
A few years ago, we worked on automatically removing boilerplate text from e-books taken from the Project Gutenberg. In this type of problem, you want to quickly identify whether a line of text is likely part of the boilerplate text. You could build a hash table, but hash tables tend to be use a lot of memory because they must store all the keys. This is especially bad if the keys are long strings.
Can you implement a hash table without storing the keys? You cannot, of course. However, you can learn from Bloom filters: hash functions are sufficient to test quickly whether an element is not part of a set.
Let me consider an example. Suppose that I have three strings (“dad”, “mom” and “dog”) and they have the hash values 2, 4, and 5. Suppose that I want to check whether “cat” is part of the set. Maybe its hash value is 3. In this case, I can immediately see that 3 is not part of the set (because 3 is not in {2,4,5}). Of course, if “cat” hashes to 2, I could falsely conclude that “cat” is part of my set. We can increase the accuracy of this method by using several hash functions.
In my e-book scenario, I can check quickly whether a line is not part of the boilerplate. This might be good enough.
There is a catch however: computing many hash functions is expensive. Thankfully, Kirsch and Mitzenmacher showed that only two hash functions were enough. Will Fitzgerald went a step further and hinted that a single hash function might be required. Indeed, you can compute a 64-bit hash function and make two 32-bit hash functions out of it: use the first 32 bits for the first hash function and the remaining 32 bits for the other. On 64-bit processors, computing a 64-bit hash function might be just as fast as computing a 32-bit hash function. Will’s idea is perfectly sound. However, it might be trickier to implement than one might expect.
Let me consider a simple form of Fowler-Noll-Vo hashing: given a 32-bit integer c, the hash value is given by cp mod 264 for some prime p. This a decent hash functions: as remarked by Dietzfelbinger et al. (1997), it is almost universal if you pick the prime p at random. But the first 32 bits are not a good hash function. Indeed, simply pick c to be 231: all but the last of the first 32 bits will be zero irrespective of p. It also does not help if you apply a bitwise exclusive or (XOR) on c using a random seed before multiplying it by p, that is if you compute (c ^ y ) p: the first 31 bits will be determined by the product of the prime p and the random seed y. I can extend this analysis to string hashing as well.
Hence, even if you have a good 64-bit hash function (say it is almost universal), it may not be true that the first 32 bits form a good hash function. How can you avoid this problem? If your hash function is strongly universal, then the first few bits will indeed form a strongly universal hash function of its own. But this may not be practical if you require 32-bit hash functions: I do not know how to compute quickly a strongly universal 64-bit hash function using 64-bit arithmetic. You can, however, generate a strongly universal 32-bit hash function using 64-bit arithmetic. Just compute h(x)=(ax+b) / 232 where a and b are random 64-bit integers and the value x is a 32-bit integer. Then, you can safely consider the first and the last 16 bits: they will be strongly universal.
So it is simply not true that a 64-bit hash function can be treated safely as two 32-bit hash functions.
Further reading: Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010.
Note: I am abusing the language since a single function cannot be universal or strongly universal, but I did not want to talk about families of hash functions.