Thanks for the article – indeed, interpolation search is not something I usually think of when searching in sorted lists. I should probably change that.
However, there is one major downside that you did not mention: If the data that you search in is in some way generated by a second party (say, a user), an adversary can easily construct worst-case input data, and I think you can force the algorithm to visit every single element in the array. Of course, in this case your initial assumption that you know the distribution of the values is not true anymore.
Jon Stewartsays:
Howdy, I came up with a variant of interpolation search for when the data set is immutable—use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot). In digital forensics/information security, we often have large sets of cryptographic hash values (MD5/SHA-1/SHA2-256) from previously examined files and simply want to know whether a provided hash value is in the set. Since these sets are largely immutable, we just write the hash values as sorted binary, with a simple file header that stores the determined error. The first hash value then begins at offset 4096 and the file can be memory mapped for binary search.
At least twice as many lookups per second can be done this way than with a naive binary search. With a real world hash set that’s about a GB in size, the maximum error only has an ~8KB radius, so we only need to hit a couple of pages right next to each other. And… the implementation is pleasingly simple.
use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot).
A few years ago, AppNexus used static interpolation search (https://dl.acm.org/doi/abs/10.5555/982792.982870) for range queries in large sorted arrays (one of the heaviest user at the time was a static variant of Chazelle’s filtered search for 1D stabbing queries). Ruchir Khaitan gave a talk about the static interpolation search itself at !!Con 2017 https://www.youtube.com/watch?v=RJU2cXvQ9po
It summarizes a few practical tricks of running the search quickly including a 3-point interpolation technique which fits non-linear distributions better.
3-point interpolation can be useful as a search acceleration method when you’re trying to approximate a complex function; my library, meshoptimizer, uses this in one of the algorithms instead of a binary search to find the optimal parameter in the optimization algorithm to get faster convergence.
Interpolation search, including automatic fallback to bisection has been used for a long time in optimization!
For example, https://en.wikipedia.org/wiki/Brent%27s_method – available in scipy, Boost, etc. – has careful conditions for when to use interpolation or bisection, which improves worst case behaviour.
I used to teach numerical analysis so I am familiar with such techniques, at least at a high level, but I was focusing on the search problem within arrays, as a data structure.
Adrian Colyer recently featured a learned sorting algorithm that is fancy interpolation. It doesn’t actually learn the algorithm, just the distribution of the data.
The authors don’t seem to understand strings though — they use some kind of vector representation instead of the natural embedding of binary strings as numbers in [0,1).
On second reading, they do encode strings as numbers, but they also appear to do a sample sort to smooth out the distribution, so it’s not clear how important the numeric model is.
FWIW many data warehouse products use some form of block range index that amounts to a piecewise specification of the distribution. It seems more robust against clumpy distributions.
Michaelsays:
The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.
Regarding hash table implementations using more memory to gain performance, what is this in regard to? I can think of a few meanings:
If you are referring to storing the hash to quicken the comparison, that can be eliminated.
If you are referring to leaving slots open for new items, that implies that the dataset is mutable. I assume that mutable sorted array implementations have much worse write performance if they don’t leave slots open. Copying the entire array is linear to add an item. Linear-complexity writes, although most-likely much slower, could be achieved in an exact-size array hash table. Given that the extra space is typically included to improve write performance (and to slightly improve scan performance), I suppose that an implementation could eliminate the empty slots if memory usage is more concerning.
If you are referring to an immutable “perfect” hash table, given enough time, is it not possible to find a hash algorithm that uniquely associates each item with a slot in an exact-size array? The only memory overhead would be constant-sized inputs into the hash algorithm.
I’m interested in understanding which distinction you mean and, if it’s as substantial as you originally meant, how so.
The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.
Have you tried implementing code that merges two sorted arrays, and then benchmark against code that merges two distinct hash tables?
Regarding hash table implementations using more memory to gain performance, what is this in regard to?
My statement is ‘you should be mindful that many hash table implementations gain performance at the expense of higher memory usage’. If you disagree then you have to demonstrate that not many hash table implementations use a lot of memory. You can use, say, an unordered_set in C++ or a HashSet in Java. To store integers, you will need tens of bytes per entry.
Travis Downssays:
You can use, say, an unordered_set in C++ or a HashSet in Java. To store integers, you will need tens of bytes per entry.
It is true that specifically unordered_set and HashSet use a lot of memory per entry and their practical performance often suffers because of it, but I would argue that it is not “fundamental” to hash tables and it is not really related to the memory/performance tradeoff that hash tables make.
Rather, in the case of Java it is related to a type-erased generics implementation which makes all collection types inefficient for primitives. Only primitive arrays like int[] escape this penalty. If you care, you can use per-primitive type specialized hash sets which use the primitive underlying array type. I would consider this a Java quirk (it also applies in some other language) and nothing to do with hash sets, per se.
Similarly, C++ made the unfortunate decision to make their hash sets effectively require a node based design: needing a dynamically allocated node per element. Very unfortunate, but not an intrinsic property of hashes. Plenty of good C++ hash sets/maps out there without this problem.
Now, hash sets/maps do make a space/speed tradeoff: having a load factor < 1, which wastes some space. However, this penalty is usually less than 2x, i.e., it should take less than 8 bytes to store a 4-byte integer in most implementations. I.e., it is much lower than the unfortunate implementations above would suggest.
Tobin Bakersays:
If you squint at them the right way, open-addressed hash tables like Robin Hood or Bidirectional Linear Probing[1] are nothing but sorted arrays (sorted by exact hash value). That makes merging two hash tables trivial (and it also means that a lookup in such a hash table is essentially just interpolation search).
I think the typical stumbling point for interpolation search approach is that the stored distribution isn’t known or easy to approximate (with sufficient precision). Hash-like keys might make sense in some cases I guess. On theory one can do something like explained at end of section 3.4 in http://v.cunat.cz/theses/master.pdf but perhaps getting the (space) constants low could be difficult, though full dynamic-ity might be worth that cost.
Maybe if the rate of changes were low, some dynamization of the static array would be worthwile. Perhaps just keeping spaces (with copies of some of the two adjacent keys), as there’s a simple way [1] of keeping even worst-case amortized moves to log-squared factor – which might be OK-ish in practice. Moreover if the distribution is not changing (much) over time, note that all insertions are expected to be uniformly distributed within the array (say… the array index is roughly uniformly distributed, at least if I disregard any spaces or if those are uniform).
[1] Bender &al.: Two Simplified Algorithms for Maintaining
Order in a List
My text in the previous link also contains lots of other theoretical stuff related to interpolation search (for anyone interested). My conclusion overall was that these techniques don’t pay off in practice. Perhaps some variation on that special case above, but otherwise… I expect it would have to be some veeery special use case.
Thanks for the article – indeed, interpolation search is not something I usually think of when searching in sorted lists. I should probably change that.
However, there is one major downside that you did not mention: If the data that you search in is in some way generated by a second party (say, a user), an adversary can easily construct worst-case input data, and I think you can force the algorithm to visit every single element in the array. Of course, in this case your initial assumption that you know the distribution of the values is not true anymore.
Howdy, I came up with a variant of interpolation search for when the data set is immutable—use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot). In digital forensics/information security, we often have large sets of cryptographic hash values (MD5/SHA-1/SHA2-256) from previously examined files and simply want to know whether a provided hash value is in the set. Since these sets are largely immutable, we just write the hash values as sorted binary, with a simple file header that stores the determined error. The first hash value then begins at offset 4096 and the file can be memory mapped for binary search.
At least twice as many lookups per second can be done this way than with a naive binary search. With a real world hash set that’s about a GB in size, the maximum error only has an ~8KB radius, so we only need to hit a couple of pages right next to each other. And… the implementation is pleasingly simple.
use the expected slot as the midpoint to binary search but then choose the low and high guards based off the precomputed maximum error of any item in the set (i.e., distance from its expected slot).
That’s an interesting proposal.
A few years ago, AppNexus used static interpolation search (https://dl.acm.org/doi/abs/10.5555/982792.982870) for range queries in large sorted arrays (one of the heaviest user at the time was a static variant of Chazelle’s filtered search for 1D stabbing queries). Ruchir Khaitan gave a talk about the static interpolation search itself at !!Con 2017 https://www.youtube.com/watch?v=RJU2cXvQ9po
I converted your implementation to python.
github gist
Thank you.
I’d also recommend reading “Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search”
http://pages.cs.wisc.edu/~chronis/files/efficiently_searching_sorted_arrays.pdf
It summarizes a few practical tricks of running the search quickly including a 3-point interpolation technique which fits non-linear distributions better.
3-point interpolation can be useful as a search acceleration method when you’re trying to approximate a complex function; my library, meshoptimizer, uses this in one of the algorithms instead of a binary search to find the optimal parameter in the optimization algorithm to get faster convergence.
Thank you for the reference.
Interpolation search, including automatic fallback to bisection has been used for a long time in optimization!
For example,
https://en.wikipedia.org/wiki/Brent%27s_method – available in scipy, Boost, etc. – has careful conditions for when to use interpolation or bisection, which improves worst case behaviour.
Thanks John.
I used to teach numerical analysis so I am familiar with such techniques, at least at a high level, but I was focusing on the search problem within arrays, as a data structure.
Adrian Colyer recently featured a learned sorting algorithm that is fancy interpolation. It doesn’t actually learn the algorithm, just the distribution of the data.
The authors don’t seem to understand strings though — they use some kind of vector representation instead of the natural embedding of binary strings as numbers in [0,1).
On second reading, they do encode strings as numbers, but they also appear to do a sample sort to smooth out the distribution, so it’s not clear how important the numeric model is.
FWIW many data warehouse products use some form of block range index that amounts to a piecewise specification of the distribution. It seems more robust against clumpy distributions.
The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.
Regarding hash table implementations using more memory to gain performance, what is this in regard to? I can think of a few meanings:
If you are referring to storing the hash to quicken the comparison, that can be eliminated.
If you are referring to leaving slots open for new items, that implies that the dataset is mutable. I assume that mutable sorted array implementations have much worse write performance if they don’t leave slots open. Copying the entire array is linear to add an item. Linear-complexity writes, although most-likely much slower, could be achieved in an exact-size array hash table. Given that the extra space is typically included to improve write performance (and to slightly improve scan performance), I suppose that an implementation could eliminate the empty slots if memory usage is more concerning.
If you are referring to an immutable “perfect” hash table, given enough time, is it not possible to find a hash algorithm that uniquely associates each item with a slot in an exact-size array? The only memory overhead would be constant-sized inputs into the hash algorithm.
I’m interested in understanding which distinction you mean and, if it’s as substantial as you originally meant, how so.
Thanks!
The comparison you make between a hash table and a sorted array seems flawed to me. I agree that the array is an easier way to traverse the data in order, and I would need to think about the merging distinction more before drawing a conclusion.
Have you tried implementing code that merges two sorted arrays, and then benchmark against code that merges two distinct hash tables?
Regarding hash table implementations using more memory to gain performance, what is this in regard to?
My statement is ‘you should be mindful that many hash table implementations gain performance at the expense of higher memory usage’. If you disagree then you have to demonstrate that not many hash table implementations use a lot of memory. You can use, say, an unordered_set in C++ or a HashSet in Java. To store integers, you will need tens of bytes per entry.
It is true that specifically
unordered_set
andHashSet
use a lot of memory per entry and their practical performance often suffers because of it, but I would argue that it is not “fundamental” to hash tables and it is not really related to the memory/performance tradeoff that hash tables make.Rather, in the case of Java it is related to a type-erased generics implementation which makes all collection types inefficient for primitives. Only primitive arrays like
int[]
escape this penalty. If you care, you can use per-primitive type specialized hash sets which use the primitive underlying array type. I would consider this a Java quirk (it also applies in some other language) and nothing to do with hash sets, per se.Similarly, C++ made the unfortunate decision to make their hash sets effectively require a node based design: needing a dynamically allocated node per element. Very unfortunate, but not an intrinsic property of hashes. Plenty of good C++ hash sets/maps out there without this problem.
Now, hash sets/maps do make a space/speed tradeoff: having a load factor < 1, which wastes some space. However, this penalty is usually less than 2x, i.e., it should take less than 8 bytes to store a 4-byte integer in most implementations. I.e., it is much lower than the unfortunate implementations above would suggest.
If you squint at them the right way, open-addressed hash tables like Robin Hood or Bidirectional Linear Probing[1] are nothing but sorted arrays (sorted by exact hash value). That makes merging two hash tables trivial (and it also means that a lookup in such a hash table is essentially just interpolation search).
[1] https://www.semanticscholar.org/paper/Ordered-Hash-Tables-Amble-Knuth/174468273bbb4b13b47bf5b0055845ac33254ab7
I think the typical stumbling point for interpolation search approach is that the stored distribution isn’t known or easy to approximate (with sufficient precision). Hash-like keys might make sense in some cases I guess. On theory one can do something like explained at end of section 3.4 in http://v.cunat.cz/theses/master.pdf but perhaps getting the (space) constants low could be difficult, though full dynamic-ity might be worth that cost.
Maybe if the rate of changes were low, some dynamization of the static array would be worthwile. Perhaps just keeping spaces (with copies of some of the two adjacent keys), as there’s a simple way [1] of keeping even worst-case amortized moves to log-squared factor – which might be OK-ish in practice. Moreover if the distribution is not changing (much) over time, note that all insertions are expected to be uniformly distributed within the array (say… the array index is roughly uniformly distributed, at least if I disregard any spaces or if those are uniform).
[1] Bender &al.: Two Simplified Algorithms for Maintaining
Order in a List
My text in the previous link also contains lots of other theoretical stuff related to interpolation search (for anyone interested). My conclusion overall was that these techniques don’t pay off in practice. Perhaps some variation on that special case above, but otherwise… I expect it would have to be some veeery special use case.