Yes. Some people report using 64-bit hash values as identifiers and they sometimes believe that it makes collision highly improbable. That’s true when dealing with tiny sets, but as I demonstrate, for many practical cases, collisions are actually quite likely.
So, yes, this post is motivated by my discussions with people building systems.
Randomly generated UUIDs, which are 122 bits, are less problematic: the probability of getting a duplicate is much lower than the probability of being hit by a meteorite (I calculated that some time ago and put it on Wikipedia, but somebody edited it away).
128-bit values are even better. However, there is a risk if you generate them with a “lower quality” hash function (not sure what low quality is… Murmur hash seems OK, SipHash is better I think). If an attacker can supply the data, then MD4, MD5, and even SHA-1 are risky: the attacker could use specially crafted data that will result in a hash collision. So, for this case, better use SHA-256.
John the Scottsays:
what about a 160 bit hash built by ripemd(sha256), i.e, bitcoin?
And that is why I always choose sequential id assignment instead of random 🙂
This is, however, an interesting idea. What would be the probability to collide twice in a row if whenever we find a collision we retry and generate a new random 64-bot id?
Interesting! I’d be curious to know why you wrote a post about that. Did you encounter such a problem in your own work?
Yes. Some people report using 64-bit hash values as identifiers and they sometimes believe that it makes collision highly improbable. That’s true when dealing with tiny sets, but as I demonstrate, for many practical cases, collisions are actually quite likely.
So, yes, this post is motivated by my discussions with people building systems.
And with 64-bits, they are easy to find! I did this exercise recently with FNV-1a: https://twitter.com/pstbrn/status/1201044024892829697
(This is hash collisions, not random identifiers. But finding them used the same principle – birthday attack.)
Randomly generated UUIDs, which are 122 bits, are less problematic: the probability of getting a duplicate is much lower than the probability of being hit by a meteorite (I calculated that some time ago and put it on Wikipedia, but somebody edited it away).
128-bit values are even better. However, there is a risk if you generate them with a “lower quality” hash function (not sure what low quality is… Murmur hash seems OK, SipHash is better I think). If an attacker can supply the data, then MD4, MD5, and even SHA-1 are risky: the attacker could use specially crafted data that will result in a hash collision. So, for this case, better use SHA-256.
what about a 160 bit hash built by ripemd(sha256), i.e, bitcoin?
160-bit is perfectly fine.
There is a huge difference between 64-bit and 160-bit.
And that is why I always choose sequential id assignment instead of random 🙂
This is, however, an interesting idea. What would be the probability to collide twice in a row if whenever we find a collision we retry and generate a new random 64-bot id?