Basically, in a hash-based language like python, speed is absolutely critical, and so a simple deterministic function beats a random one. They also choose a hash function with good memory performance in common cases.
ISTM that a better solution than “use randomized hashing”, which has many downsides for *everyone* is, if you’re accepting possible malicious input, you should check it before hashing.
It’s more work for the implementor of the program, but it doesn’t penalize absolutely everyone else. (Think of the poor scientist whose numpy program would slow down because Django wanted randomized hashing 🙂
Also, it’s notable that the implementers of the algorithm implicitly mention that it’s vulerable to an attack of this sort: “it’s extremely unlikely hash codes will follow a 5*j+1 recurrence by accident”
Mariesays:
Great bit of information. Thanks for summarizing!
It seems to me that this should be handled at the application level, rather than at language design. Security isn’t really a concern for many users of the language (such as those in the research community who mostly use Java for building prototypes).
For a commercial application, programmers would also need to do a lot more in order to be truly resistant to all types of DOS attacks. So overriding the default hash function doesn’t seem like alot to ask for.
I agree that it is one of countless issues to watch out for.
As it was successfully implemented in Ruby, a mainstream language, it is unclear why it cannot be implemented in other languages like Java.
As far as I can tell, it would be fairly difficult for a user to override String.hashCode since the String class is final. You would need to create your own String class or your own collection framework. You can also modify the strings themselves, for example by prepending a seed, but you pay again a price. Most likely, software using Java will use other workarounds than random hashing.
Is there a performance price at all? If I am reading it correctly, a “random hash function” would only replace that 31 with a new random value chosen every time the *application* is started. So for a large application the overhead would be negligible.
Ameysays:
Of course, having 4 strings colliding will not disrupt hash tables. But because the hash function is iterated, we can multiply the number of collisions. Indeed, any two same-length sequences of these four colliding strings will also collide. This means that you can construct 16 strings of length 6 all colliding. You can keep going to 64 strings of length 9.
Daniel,
I’m sorry but I did not grasp this part. Can you please explain it more?
Mike G.says:
<>
Well, and perl – perl’s hashes don’t randomly permute by default, but they do if a given hash chain gets too long. See “perldoc perlsec” for any perl newer than 5.8.1 – all hashes were randomized in 5.8.1, but for compatibility in 5.8.2 that was changed so that only hashes which are “behaving badly” get randomized.
Mike G.says:
(Whoops, I was trying to quote your “Alas, the only programming language to adopt random hashing was Ruby.” line, but it looks like it got swallowed at the top of comment 7)
The hash involving 31 is the Knuth hash right? Or is that 33? In any case it’s not a top-performance hash anyway so that’s probably something to fix first.
@Federico picking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
What I’d rather see is two HashMap classes so that ordinary people can still have high performance when necessary. For security reasons, I’d say that HashMap should be randomized and DetHashMap could be a faster deterministic version.
I seem to recall reading that 33 and 31 were partly important because they were primes greater than the number of lowercase English characters, which is a common evaluation for hashes. There’s some inkling of that in the following link and it also suggests that you don’t want to use those hashes with numeric data. http://www.strchr.com/hash_functions
Part of my issue with multiplicative hashes is that short keys tend to clump up. But in language processing, that can accidentally lead to many collisions in looking common words like “a”, “the”, etc and then kill performance.
This type of hash function is used in the Karp-Rabin algorithm, though maybe not with value 31. Could be that Knuth should get credit for it, I don’t know.
icking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
Do you have reference supporting this statement? GCC uses 5.
I looked around but couldn’t find where I’d seen that 31 or 33 were special for multiplicative hash. I still think I read that somewhere but can’t find it. It could’ve been something in one of my students’ projects, but I don’t have those around anymore.
If I get some free time, I’ll run a test. In the meantime, I’ll weaken my statement and say that the constant should be a prime and there may be collision ramifications for some of those primes compared to others and it’d be good to check before blindly using any arbitrary prime. Also, I’m assuming that we’re only talking about power-of-2 table sizes for what it’s worth.
whoops, that was an editing goof on my part – when I wrote the first part (saying that we want primes) I was assuming an API-like hashtable where we might not have control over the size. Then I wrote that I was assuming power-of-two sizes.
> Why aren’t programming languages adopting random hashing?
There are many reasons to use deterministic hashing besides making a language easier to debug.
I recommend reading this extensive comment in the python source tree about its choice of hash function: http://hg.python.org/cpython/file/12de1ad1cee8/Objects/dictobject.c#l34
Basically, in a hash-based language like python, speed is absolutely critical, and so a simple deterministic function beats a random one. They also choose a hash function with good memory performance in common cases.
ISTM that a better solution than “use randomized hashing”, which has many downsides for *everyone* is, if you’re accepting possible malicious input, you should check it before hashing.
It’s more work for the implementor of the program, but it doesn’t penalize absolutely everyone else. (Think of the poor scientist whose numpy program would slow down because Django wanted randomized hashing 🙂
Also, it’s notable that the implementers of the algorithm implicitly mention that it’s vulerable to an attack of this sort: “it’s extremely unlikely hash codes will follow a 5*j+1 recurrence by accident”
Great bit of information. Thanks for summarizing!
It seems to me that this should be handled at the application level, rather than at language design. Security isn’t really a concern for many users of the language (such as those in the research community who mostly use Java for building prototypes).
For a commercial application, programmers would also need to do a lot more in order to be truly resistant to all types of DOS attacks. So overriding the default hash function doesn’t seem like alot to ask for.
@Marie
I agree that it is one of countless issues to watch out for.
As it was successfully implemented in Ruby, a mainstream language, it is unclear why it cannot be implemented in other languages like Java.
As far as I can tell, it would be fairly difficult for a user to override String.hashCode since the String class is final. You would need to create your own String class or your own collection framework. You can also modify the strings themselves, for example by prepending a seed, but you pay again a price. Most likely, software using Java will use other workarounds than random hashing.
@Bill
An interesting link, thanks! I think it makes plenty of sense not to sacrifice too much performance for security.
I wonder however how much of a price Ruby is paying, if any, for its random hash function.=
Is there a performance price at all? If I am reading it correctly, a “random hash function” would only replace that 31 with a new random value chosen every time the *application* is started. So for a large application the overhead would be negligible.
Of course, having 4 strings colliding will not disrupt hash tables. But because the hash function is iterated, we can multiply the number of collisions. Indeed, any two same-length sequences of these four colliding strings will also collide. This means that you can construct 16 strings of length 6 all colliding. You can keep going to 64 strings of length 9.
Daniel,
I’m sorry but I did not grasp this part. Can you please explain it more?
<>
Well, and perl – perl’s hashes don’t randomly permute by default, but they do if a given hash chain gets too long. See “perldoc perlsec” for any perl newer than 5.8.1 – all hashes were randomized in 5.8.1, but for compatibility in 5.8.2 that was changed so that only hashes which are “behaving badly” get randomized.
(Whoops, I was trying to quote your “Alas, the only programming language to adopt random hashing was Ruby.” line, but it looks like it got swallowed at the top of comment 7)
@Amey I’ll edit my post to give an example.
@Mike Thanks. I updated the post and I give you credit for the observation.
Thanks a lot Daniel. Appreciate it.
The hash involving 31 is the Knuth hash right? Or is that 33? In any case it’s not a top-performance hash anyway so that’s probably something to fix first.
@Federico picking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
What I’d rather see is two HashMap classes so that ordinary people can still have high performance when necessary. For security reasons, I’d say that HashMap should be randomized and DetHashMap could be a faster deterministic version.
After looking, I must’ve confused Knuth with Kerrigan and Ritchie – that’s who I’ve seen attributed with 31 and Bernstein with 33.
I believe the note I saw about 31 and 33 being much better was in Sedgewick’s book but I’ll have to check when I get home. Here’s one article that’s picky about using 33, but no data that I see:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
I seem to recall reading that 33 and 31 were partly important because they were primes greater than the number of lowercase English characters, which is a common evaluation for hashes. There’s some inkling of that in the following link and it also suggests that you don’t want to use those hashes with numeric data.
http://www.strchr.com/hash_functions
Part of my issue with multiplicative hashes is that short keys tend to clump up. But in language processing, that can accidentally lead to many collisions in looking common words like “a”, “the”, etc and then kill performance.
@Keith
The hash involving 31 is the Knuth hash right?
This type of hash function is used in the Karp-Rabin algorithm, though maybe not with value 31. Could be that Knuth should get credit for it, I don’t know.
icking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
Do you have reference supporting this statement? GCC uses 5.
I looked around but couldn’t find where I’d seen that 31 or 33 were special for multiplicative hash. I still think I read that somewhere but can’t find it. It could’ve been something in one of my students’ projects, but I don’t have those around anymore.
If I get some free time, I’ll run a test. In the meantime, I’ll weaken my statement and say that the constant should be a prime and there may be collision ramifications for some of those primes compared to others and it’d be good to check before blindly using any arbitrary prime. Also, I’m assuming that we’re only talking about power-of-2 table sizes for what it’s worth.
@Keith
You probably meant “odd” and not “prime”.
whoops, that was an editing goof on my part – when I wrote the first part (saying that we want primes) I was assuming an API-like hashtable where we might not have control over the size. Then I wrote that I was assuming power-of-two sizes.