2nd October 2009, 13 min read

Sensible hashing of variable-length strings is impossible

15 thoughts on “Sensible hashing of variable-length strings is impossible”

Yuriy says:

October 2, 2009 at 4:54 pm

Suppose A = {0, 2, 4, 6, …}
and A’ = {1, 3, 5, 7, …}
Their intersection is empty so the h and h’ that produced A and A’, respectively, do not collide on any values.
Daniel Lemire says:

October 2, 2009 at 5:03 pm

@Yuriy I am not saying that any two infinite sets have a non-empty intersection.

I am saying that if you distribute an infinite number of elements into a finite number of buckets, one of the buckets will contain infinitely many elements.

I construct A’ from A, so A’ is a subset of A.
Yuriy says:

October 2, 2009 at 5:26 pm

Yes, one bucket, for h, will contain infinitely many elements. Then one different bucket, for h’, will contain infinitely many elements. And so on. You have |H| infinite-sized buckets.

But when you say that an infinite number of keys collide over all H functions, you are saying that the intersection of the buckets is infinite.

(Or is the statement you are trying to prove “There is always an infinite number of keys that are certain to collide over each h in H”, as opposed to “over all H”?
Daniel Lemire says:

October 2, 2009 at 5:34 pm

Let us restart from the beginning. With one (predetermined) hash function h, you can find one hash value y such that infinitely many strings are mapped to y.

Ok. Now, consider all these strings and only these strings (discard any string that does not satisfy h(s)=y). Then use a *different* hash function f’ and repeat. Once more there will be a hash value y’ such that infinitely many strings will map to y’ (and they also map to y using h).

So far we have infinitely many strings s such that h(s)=y and h(s’)=y’.

Just keep going, H times. Each time your set will grow smaller, but it will remain infinite. Because you only have H hash functions in total, you’ll end up with a set of strings such that they all map to y using the hash function h, they all map to y’ using hash function h’, they all map to y” using hash function h” and so on.

So far so good?

The trick, of course, is that I assume you have a finite number of hash functions, while having infinitely many strings. That’s the secret sauce.
Preston L. Bannister says:

October 3, 2009 at 1:08 am

Clearly, we have very different definitions of the word “sensible”. Especially in combination with the “infinite” anything.

Given the odd usage of “sensible”, I suppose you could use the function:

h(s) = s

Otherwise you are going to have to clarify.
Daniel Lemire says:

October 3, 2009 at 7:52 am

@Preston

The hash function h(s)=s would not be acceptable because it does not hash to the set of integers {1,2,â€¦,b}..
Kevembuangga says:

October 3, 2009 at 9:57 am

the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding

Curious hypothesis about the keys.
Infinitely many distinct keys entails an infinite key size average.
Not terribly realistic…
Mark Reid says:

October 3, 2009 at 6:48 am

I’m not quite sure why you have several comments arguing against your relatively straightforward proof. It’s really not much more than a kind of pigeonhole principle for infinite sets.

It is also reminiscent of part of the proof of the Heine-Borel theorem (closed and bounded implies compact) that “chases” the infinite part of a partition.
lorg says:

October 3, 2009 at 7:33 am

If put into different words:
You showed that there is an infinite number of keys, that for each hash function, will hash to the same value under that hash function.
That is, there exists a set A of keys, such that
for each h in H and for each x1, x2 in A, h(x1) = h(x2).

Yuriy:
This still works, because he didn’t force them to hash to *the same value*.
So your h and h’ do not collide. Instead, they have a same infinite set of colliding keys.

Daniel:
I’m not sure that this makes hashing of infinite number of keys impractical. I will have to give it some more thought.
Thanks for the interesting blog post.
Preston L. Bannister says:

October 3, 2009 at 2:26 pm

@Daniel

So far as I could tell, you did not specify that the set of hashed-to integers was finite. 🙂

Sometimes address-based hashing (if “hashing” is the right word) is the right solution.

Please define “sensible”, in this context.
Julian Hyde says:

October 3, 2009 at 2:45 pm

I agree with Preston. It all depends on your definition of the word ‘sensible’.

There is a math theorem that between any two irrational numbers there are infinitely many rational numbers. It’s true, but it’s not, shall we say, a helpful fact, because there are loads more irrational numbers in the gap than rational numbers. This result is a similar to your theorem: true but misleading.

It’s always possible to be unlucky with your choice of hash function. But it’s fairly easy to choose a hash function where you are very, very unlikely to be unlucky. And for this pragmatist, that is good enough.

I say ‘fairly easy’. It took Sun three major versions of Java to do it. In JDK 1.0 and 1.1 which sampled every n’th character and XORed and this, naturally, had issues.

In JDK 1.2, String.hashCode() used to quit at about the 64th character, presumably to keep the cost of maintaining sets and maps of long strings in check. But I was using the hash map to keep unique names of members in mondrian. Many of these strings had a common prefix, and they would all end up in the same hash bucket. Performance was abyssmal.

The problem was fixed in later versions of the JDK. (I may be wrong about the precise JDK versions. This is from memory.) At the time I seriously considered using a Trie [ http://en.wikipedia.org/wiki/Trie ] , but a web search didn’t turn up any robust Java implementations. It always struck me as strange that this simple and elegant data structure seems to find so few applications in the real world.
Daniel Lemire says:

October 3, 2009 at 8:43 pm

I promise a follow-up blog post where I’ll try to answer all these valid points.
Rasmus Pagh says:

October 5, 2009 at 3:34 am

One point that has not been made clear, I think, is that there are classes of hash functions for which the length of keys needed to generate high-probability collisions is exponential in the number of bits needed to specify the hash function. Thus, you can rest assure if your strings are not too long, even if they are drawn from an infinite set.
Preston L. Bannister says:

October 5, 2009 at 11:41 am

As an aside: played a bit with a Trie implementation in Java over the weekend. After Julian’s comment, wondered how performance of a “fast” Trie might compare to a stock hash table, and this serves as an excuse to play with Tries.

Also – in the context of the prior discussion – Tries can be viewed as a form of hashing without collisions.

My result was not encouraging:
http://bannister.us/weblog/2009/10/05/example-general-purpose-trie-in-java/
Julian Hyde says:

October 5, 2009 at 7:08 pm

Preston,

Interesting. Thanks for doing the experiments. I would have thought that trie insert performance would suck (because of the effort of clearing an array of 16 or 256 pointers), but you found that access performance sucks too.

The case I had in mind was for long strings — recall that my original problem was with strings longer than 64 chars, and JDK 1.2 stopped hashing at the 64th character — so it’s possible that tries have their place if strings are very long, say 1k each.

But you’ve shown that tries are a long way behind hash tables. I think it’s down to poor use of the CPU cache. Resolving a string in a hash table should be only a couple of memory accesses, whereas a trie access might read 3 or 4 or 5 nodes. The memory wait time overwhelms the extra CPU cost of computing a hash code, while the string being scanned is all within the same block of CPU cache.

Looks like tries just aren’t cut out for modern architectures.

Julian