Good point. I think that hash tables are more common however.
@glenn
I think that Ruby uses some extra data structures (liked a linked list) to achieve order-preservation in its hash tables. So they are a bit more than a hash table… Good point though!
@Homer
From a practical point of view, I agree with you. However, exploiting the properties of your hash functions can be a very neat trick whenever it works.
Angelo Pescesays:
@Homer
But you should be excited about your work from time to time, otherwise, what is the point?
True. Most clever tricks are overrated. But they sometimes expand our mind.
Wes Freemansays:
You can also sort your LinkedHashMap/Set (in Java) by the key. The sorting operation has high cost (as you basically have to rebuild it in the proper order), but the HashMap/Set would then already be presorted if you were to want to give a list of alphabetized results, for example. If you have much fewer inserts than you have ordered listings, it might be worth it.
I’m still a little unsure that I see the practical value for this. This is for cases where Tree-based Maps are too slow, but you need ordered result lists?
Angelo Pescesays:
@Wes
Consider the following scenario. You have a massive key-value database. You sometimes, but rarely, need to access the keys sequentially.
So, you basically want a hash table, with the ability to scan the keys in order when you must. In the case of my example, you could scan all keys in 26 buckets, and get all the Smiths. Of course, you would need to filter these buckets for the Smiths (as other keys would be there as well).
Having to retrieve from disk only 26 buckets instead of, say, a thousand, could be a great boost.
My point being that you can use locality-preserving hash functions on your keys to make sure your “similar keys” end up clustered in a few buckets.
This is entirely general. If you use the key “city”, you could use a hash function that hash “nearby cities” to “nearby buckets”. It can be entirely reasonable.
Of course, whether custom hash functions are worth it… well, I guess you have to try it out.
“Hashes enumerate their values in the order that the corresponding keys were inserted.”
Wes Freemansays:
@Angelo
I see–optimizing for the lower use case without affecting [much] the higher use case. In massive key-value databases, limiting your buckets as in the name example, would probably have a negative impact on performance, depending on your collision resolution. I’m tempted to try it out with some tests.
How about nested Hashes, where you have larger partial-key buckets first (like last name, first letter of first name), and then full-key buckets within?
Itmansays:
Tries have comparable (and constant) retrieval times (sometimes even better than hashes). The are also locality preserving. Perhaps, even better are tries where lower-levels are grouped into buckets.
I think that relying on an ordering property like this in code would be intrinsically dangerous. The two big things you have to consider are collisions (and whether the hash handles them internally or externally) and resizing, which would change the distribution of the keys.
If you need both a hash and an ordering, I think it is far better to keep them as separate (but intertwining data-structures). That is, you insert a node into an ordered list and you hash it too (or create a hash index for the list). Otherwise I suspect that you are trying to use a hammer on a screw …
When I was a younger programmer I came up with this really clever trick to solve a difficult programming problem. In my excitement I approached one of the older developers and energetically explained my cleverness.
He looked at me for a long quiet moment, then finally said “Monkey’s are clever”.
The problem with clever tricks is that they tend to come back to haunt you when you least expect it.
I don’t think the other programmer was trying to tell me not to be excited, but rather that clever tricks tend to be rather nasty “landmines” placed haphazardly in the code. What we’re looking for in a solution is to encapsulate the problem in a way that we can move onto to larger and more interesting problems. I think that’s also the rational behind “premature optimization is the root of all evil” (you spend too much time on the small stuff, and make it too hard to reuse later).
I absolutely love programming, always have, and over the years I’ve learned that it’s important to solve the problems the right way, not just any way that is quick (or possibly clever). There is a subtle difference there, but one worth noting.
I took a combinatorics class long, long ago and the Prof used to always talk about “the obvious first try” in terms of algorithms. It’s essentially the first thing that most people think of when they are presented with the problem. His lesson to us, was that we had to get beyond that to get to the really interesting and functional solutions. It’s a lesson that always stuck with me.
Paul.
Itmansays:
@Paul,
I think that a trick is not to use clever tricks unless you really-really need it. Sometimes, however, clever tricks are necessary.
A monotone hash function from [1..N] to [1..N]? Cool, then it’s trivial to sort N keys in O(N) time and space.
The locality-preserving hash functions aren’t that good; they don’t *perfectly* preserve the order of the keys. But I do wonder if one could use a locality-preserving hash to get a fast-on-average sorting function. For example, a single pass to (loc.-pres.) hash N keys into an array, followed by a quick pass of insertion-sort to clean up.
Itmansays:
Angelo,
This likely to be possible only if you have extra memory and/or knowledge about the distribution of data.
F. Carrsays:
@Angelo: Yes, I’m aware of that. The interesting thing is that locality-preserving hashes appear to ensure that the number of possible keys N will be the same order of magnitude as the number of actual keys n.
Per Westermarksays:
Old article, but one thing to note.
In a hostile world, an order-preserving hash function can be quite dangerous. It’s a very good attack vector when an attacker can control the input data and hence the bucket mapping and can then be used to create a very uneven hash structure with a potential DoS outcome.
In a robust world, many hash functions goes the other route and creates a random salt just so an attacker will not be able to predict the hash happing without significant runtime analysis of the performance.
@Itman
Good point. I think that hash tables are more common however.
@glenn
I think that Ruby uses some extra data structures (liked a linked list) to achieve order-preservation in its hash tables. So they are a bit more than a hash table… Good point though!
@Homer
From a practical point of view, I agree with you. However, exploiting the properties of your hash functions can be a very neat trick whenever it works.
@Homer
But you should be excited about your work from time to time, otherwise, what is the point?
True. Most clever tricks are overrated. But they sometimes expand our mind.
You can also sort your LinkedHashMap/Set (in Java) by the key. The sorting operation has high cost (as you basically have to rebuild it in the proper order), but the HashMap/Set would then already be presorted if you were to want to give a list of alphabetized results, for example. If you have much fewer inserts than you have ordered listings, it might be worth it.
I’m still a little unsure that I see the practical value for this. This is for cases where Tree-based Maps are too slow, but you need ordered result lists?
@Wes
Consider the following scenario. You have a massive key-value database. You sometimes, but rarely, need to access the keys sequentially.
So, you basically want a hash table, with the ability to scan the keys in order when you must. In the case of my example, you could scan all keys in 26 buckets, and get all the Smiths. Of course, you would need to filter these buckets for the Smiths (as other keys would be there as well).
Having to retrieve from disk only 26 buckets instead of, say, a thousand, could be a great boost.
My point being that you can use locality-preserving hash functions on your keys to make sure your “similar keys” end up clustered in a few buckets.
This is entirely general. If you use the key “city”, you could use a hash function that hash “nearby cities” to “nearby buckets”. It can be entirely reasonable.
Of course, whether custom hash functions are worth it… well, I guess you have to try it out.
@Glenn
From my cursory investigation, Ruby doesn’t maintain key order, it maintains insert order, just like a LinkedHashMap in Java. From http://www.ruby-doc.org/core/classes/Hash.html :
“Hashes enumerate their values in the order that the corresponding keys were inserted.”
@Angelo
I see–optimizing for the lower use case without affecting [much] the higher use case. In massive key-value databases, limiting your buckets as in the name example, would probably have a negative impact on performance, depending on your collision resolution. I’m tempted to try it out with some tests.
How about nested Hashes, where you have larger partial-key buckets first (like last name, first letter of first name), and then full-key buckets within?
Tries have comparable (and constant) retrieval times (sometimes even better than hashes). The are also locality preserving. Perhaps, even better are tries where lower-levels are grouped into buckets.
In Ruby 1.9+, hash tables DO preserve key order…
I think that relying on an ordering property like this in code would be intrinsically dangerous. The two big things you have to consider are collisions (and whether the hash handles them internally or externally) and resizing, which would change the distribution of the keys.
If you need both a hash and an ordering, I think it is far better to keep them as separate (but intertwining data-structures). That is, you insert a node into an ordered list and you hash it too (or create a hash index for the list). Otherwise I suspect that you are trying to use a hammer on a screw …
Paul.
When I was a younger programmer I came up with this really clever trick to solve a difficult programming problem. In my excitement I approached one of the older developers and energetically explained my cleverness.
He looked at me for a long quiet moment, then finally said “Monkey’s are clever”.
The problem with clever tricks is that they tend to come back to haunt you when you least expect it.
Paul.
Hi Angelo,
I don’t think the other programmer was trying to tell me not to be excited, but rather that clever tricks tend to be rather nasty “landmines” placed haphazardly in the code. What we’re looking for in a solution is to encapsulate the problem in a way that we can move onto to larger and more interesting problems. I think that’s also the rational behind “premature optimization is the root of all evil” (you spend too much time on the small stuff, and make it too hard to reuse later).
I absolutely love programming, always have, and over the years I’ve learned that it’s important to solve the problems the right way, not just any way that is quick (or possibly clever). There is a subtle difference there, but one worth noting.
I took a combinatorics class long, long ago and the Prof used to always talk about “the obvious first try” in terms of algorithms. It’s essentially the first thing that most people think of when they are presented with the problem. His lesson to us, was that we had to get beyond that to get to the really interesting and functional solutions. It’s a lesson that always stuck with me.
Paul.
@Paul,
I think that a trick is not to use clever tricks unless you really-really need it. Sometimes, however, clever tricks are necessary.
@Carr
Sorting in constant time is of course possible:
http://en.wikipedia.org/wiki/Pigeonhole_sort
A monotone hash function from [1..N] to [1..N]? Cool, then it’s trivial to sort N keys in O(N) time and space.
The locality-preserving hash functions aren’t that good; they don’t *perfectly* preserve the order of the keys. But I do wonder if one could use a locality-preserving hash to get a fast-on-average sorting function. For example, a single pass to (loc.-pres.) hash N keys into an array, followed by a quick pass of insertion-sort to clean up.
Angelo,
This likely to be possible only if you have extra memory and/or knowledge about the distribution of data.
@Angelo: Yes, I’m aware of that. The interesting thing is that locality-preserving hashes appear to ensure that the number of possible keys N will be the same order of magnitude as the number of actual keys n.
Old article, but one thing to note.
In a hostile world, an order-preserving hash function can be quite dangerous. It’s a very good attack vector when an attacker can control the input data and hence the bucket mapping and can then be used to create a very uneven hash structure with a potential DoS outcome.
In a robust world, many hash functions goes the other route and creates a random salt just so an attacker will not be able to predict the hash happing without significant runtime analysis of the performance.