Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism (9 on Cannon Lake, 7 on Skylake, 5 on Skylark).
That’s a bold claim, since that would imply there isn’t really any room for further speedup! I thought Travis had concluded that Cannon Lake was actually able to sustain more than 20 outstanding L1 misses: https://www.realworldtech.com/forum/?threadid=181499&curpostid=182779.
Measuring the speedup ratio does seem like a good way of estimating the degree of memory parallelism, though. I wondered how much the ordering of the small array would affect the results, so I tried deleting the std::sort at L1141. To my surprise, it seemed to have no effect. I would have thought that the searching for nearby results in succession would have resulted in better caching — and hence faster results — than searching for them in random order. Is it expected that this would have no effect?
Actually, 20x might be an underestimation on Cannon Lake, so you are correct that I am way off from the maximum on this processor. However, on skylake, I am not too far off.
nkurz:
L1141 looks like sorting the vector for bsearch. The targets are overwritten with rand() on L1199.
Re sorting the search keys. The top of the implicit search tree should remain in cache even with random ordering, so sorting the search keys might only be expected to help data cache hits in iterations ~11-12 and above. However, the density of search keys with respect to the size of the sorted array means that there will rarely be any overlap between the keys, at these later iterations. There might be an impact on TLB hits?
If we increase the number of searches by a factor of 100-1000, we should observe more sharing in cache hits. However, I expect we’ll just end up underutilising the line fill buffers with a straight up sort of the search keys. It probably makes more sense to transpose the search keys into multiple (one per interleaved search) sorted substreams, after sorting.
Nathan Kurzsays:
Thank you, I was indeed lost with what was happening with that line. I’m still somewhat lost, but perhaps getting closer to understanding.
Jon Stewartsays:
Neat! Two other improvements for large arrays:
Cache the midpoints (corresponding to the first few levels of a binary search tree) in a small block, thereby reducing the number of memory accesses.
If the values are evenly distributed (e.g., they are hash values), one can scan the array and record the maximum error from the expected position. For example, the NIST’s NSRL hash set has ~34M unique SHA-1 values, requiring close to 700MB of memory to store as a sorted binary array. However, the maximum error is not that great: empirically, any hash value is within 16KB of its expected position. Although it takes a few more cycles to compute the expected position, the reduced window size allows for double the throughput over a naive binary search over the whole thing.
Greg Maxwellsays:
It is hard to drastically improve on the binary search if you only need to do one.
Not so– if the data is well distributed (as is usually the case if your seeking on a hash, for example) then linearly interpolating will be dramatically faster than bisection: If the point you seek has a value 1/3 of the way between your current bounds, you seek 1/3 of the way. Care should be taken to switch back to bisection if interpolation isn’t converging to avoid a linear worst case cost. There are simple procedures that guarantee a worst case only a small factor more probes than bisection, but on well conditioned data converges MUCH faster.
Similarly, when the bounds are too close, switching to a linear scan can be faster continuing to bisect.
Downside, of course– is that batching like you suggest works better with bisection because its easy to arrange lookups to share probe locations.
All that neat and clean CS theory is worth less and less with modern architectures and real world datasets. Even with the simplest of algorithms you have to think of questions such as: Are your searches mostly hits or misses? Are your keys evenly distributed or not? Which parts of your data structures might be hot and which are cold? Can you increase benefits of locality without much overhead? All possible combinations of these and many other parameters are going to affect the choice of algorithm adaptation to the architecture deep below the simple-looking ISA model and probably even creating feedback loops to approaches you want to take.
Sometimes all this makes one feel that there has to be a good, but currently unknown approach (instead of plain black magic) to creating efficient solutions. That is, solid engineering backed up by good theory. Yet, I’m perfectly unaware of it, and something like “how can we make binary search efficient” is like construction engineers considering proper use of a nail a mystery.
If I have a single binary search to perform, then, am I better off to interleave it with 31 other, unimportant binary searches? I’m confused a bit on the implications.
That’s a bold claim, since that would imply there isn’t really any room for further speedup! I thought Travis had concluded that Cannon Lake was actually able to sustain more than 20 outstanding L1 misses: https://www.realworldtech.com/forum/?threadid=181499&curpostid=182779.
Measuring the speedup ratio does seem like a good way of estimating the degree of memory parallelism, though. I wondered how much the ordering of the small array would affect the results, so I tried deleting the std::sort at L1141. To my surprise, it seemed to have no effect. I would have thought that the searching for nearby results in succession would have resulted in better caching — and hence faster results — than searching for them in random order. Is it expected that this would have no effect?
Actually, 20x might be an underestimation on Cannon Lake, so you are correct that I am way off from the maximum on this processor. However, on skylake, I am not too far off.
nkurz:
L1141 looks like sorting the vector for bsearch. The targets are overwritten with rand() on L1199.
Re sorting the search keys. The top of the implicit search tree should remain in cache even with random ordering, so sorting the search keys might only be expected to help data cache hits in iterations ~11-12 and above. However, the density of search keys with respect to the size of the sorted array means that there will rarely be any overlap between the keys, at these later iterations. There might be an impact on TLB hits?
If we increase the number of searches by a factor of 100-1000, we should observe more sharing in cache hits. However, I expect we’ll just end up underutilising the line fill buffers with a straight up sort of the search keys. It probably makes more sense to transpose the search keys into multiple (one per interleaved search) sorted substreams, after sorting.
Thank you, I was indeed lost with what was happening with that line. I’m still somewhat lost, but perhaps getting closer to understanding.
Neat! Two other improvements for large arrays:
Cache the midpoints (corresponding to the first few levels of a binary search tree) in a small block, thereby reducing the number of memory accesses.
If the values are evenly distributed (e.g., they are hash values), one can scan the array and record the maximum error from the expected position. For example, the NIST’s NSRL hash set has ~34M unique SHA-1 values, requiring close to 700MB of memory to store as a sorted binary array. However, the maximum error is not that great: empirically, any hash value is within 16KB of its expected position. Although it takes a few more cycles to compute the expected position, the reduced window size allows for double the throughput over a naive binary search over the whole thing.
Not so– if the data is well distributed (as is usually the case if your seeking on a hash, for example) then linearly interpolating will be dramatically faster than bisection: If the point you seek has a value 1/3 of the way between your current bounds, you seek 1/3 of the way. Care should be taken to switch back to bisection if interpolation isn’t converging to avoid a linear worst case cost. There are simple procedures that guarantee a worst case only a small factor more probes than bisection, but on well conditioned data converges MUCH faster.
Similarly, when the bounds are too close, switching to a linear scan can be faster continuing to bisect.
Downside, of course– is that batching like you suggest works better with bisection because its easy to arrange lookups to share probe locations.
Recently this was posted on HN: https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/ , which has some interesting observations on bs, as well.
All that neat and clean CS theory is worth less and less with modern architectures and real world datasets. Even with the simplest of algorithms you have to think of questions such as: Are your searches mostly hits or misses? Are your keys evenly distributed or not? Which parts of your data structures might be hot and which are cold? Can you increase benefits of locality without much overhead? All possible combinations of these and many other parameters are going to affect the choice of algorithm adaptation to the architecture deep below the simple-looking ISA model and probably even creating feedback loops to approaches you want to take.
Sometimes all this makes one feel that there has to be a good, but currently unknown approach (instead of plain black magic) to creating efficient solutions. That is, solid engineering backed up by good theory. Yet, I’m perfectly unaware of it, and something like “how can we make binary search efficient” is like construction engineers considering proper use of a nail a mystery.
This (and other forms of interleaving of memory access cost dominated algorithms) can be implemented easily using coroutines http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf.
If I have a single binary search to perform, then, am I better off to interleave it with 31 other, unimportant binary searches? I’m confused a bit on the implications.
Here is the second paragraph from the post: