, 15 min read
Bounding the cost of the intersection between a small array and a large array
18 thoughts on “Bounding the cost of the intersection between a small array and a large array”
, 15 min read
18 thoughts on “Bounding the cost of the intersection between a small array and a large array”
In our roaring implementation, we have an
intersectionCountArrayArray, which does exactly what it sounds like, and during some unrelated benchmarking, I noticed that the performance varied significantly depending on whether the first or second array was larger. Looking at the current code, it appears that we got noticably better performance when the larger array is the inner loop. I think I did look at trying to binary-search instead, but ended up leaving it alone for now. Total amounts of data are smallish (<8KB for each array), and having the accesses be sequential seems to improve cache performance enough to make up for reading more bits than are probably necessary. It’s been too long for me to remember the impact of the simple change, but I think that particular op ended up about 20% faster across “random” cases. (It matters less in practice, since N is nearly always under 5.)For two sorted arrays we can use the SVS algorithm see : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.415.2636&rep=rep1&type=pdf
For a two levels structure like Roaring-Bitmaps it might be possible to use intersections with skip intervals like in the TurboPFor “inverted index” demo.
Maybe this is a typo:
“The log of the factorial is almost equal to n log k”
Should have been “k log k”
Ignoring memory hierarchy considerations for the time being, you can approach (log n – (log k!) / k) comparisons by performing a suitably imbalanced binary search. I.e. when searching the large array for the smallest element in the small array with k = 1000, you first compare against the element at position ~0.693 * (n/1000) in the large array, since that comes the closest to dividing the state space evenly. The principle of dividing the state space evenly can be used to estimate all the other optimal binary search indices (this isn’t guaranteed to yield the minimal worst-case number of comparisons, but it should come close enough); and then you can come up with a simple approximation to use to select comparison indices in practice.
More precisely, the approach I just described should do a good job of reducing the average-case # of comparisons. To optimize more for the worst case, you’d want to round the first comparison index to a power of 2, and switch to plain binary search once you’ve found a larger element in the large array.
Thanks. I have now sketched something like what you describe in the blog post.
Ignoring memory hierarchy won’t work in practice. A regular binary search will hit L1 most of the time, as the indices are the same as from the previous binary search. By doing something fancy like binary search of an array subset, we reduce the number of memory lookups, but frequently go to L3 or DRAM instead. L3 is 10x more expensive than L1, DRAM is 50x more expensive.
But we could do something like evenly divide the large array into 1024 fixed subsets. Make a guess which subset contains what you are searching for. If you guessed right, you saved yourself the first 10 steps of binary search. If you guessed wrong, you can merge subsets following buddy-allocator rules. That way you keep reusing the same indices, so they are likely in L1.
Now I wonder how far that approach could be extended. One way to deal with latency is to replace bisection with octasection. I use that approach for git bisection already – if tests take a long time, but you can run them in parallel, octasection is 3x faster while doing 2.3x more work. But bisection is doing a single search. For intersection, you can do multiple searches in parallel and get a speedup without doing 2.3x more work, so probably not a good idea.
But you could just do a streaming lookup of the edges between the 1024 subsections. That removes the guessing. With k=1000, you expect to continue the search in most of those subsections, so effectively you just removed the 10 (fast) lookups at the beginning. In Daniels table, that means 22-10=12 lookups, which seems impressive, but remember that the remaining 12 lookups are likely cache-cold and 10x more expensive. So overall more like a 5% improvement, not 50%.
Ignoring memory hierarchy won’t work in practice.
I agree.
There is another optimization we can copy from various sort-implementations. Things like quicksort and binary search work great when your n is large. But after enough divisions, the remaining set is small and things like bubble sort or linear search can outperform the fancy algorithms. Most sort-implementations eventually fall back to something like bubble-sort when dealing with leaves.
Here the obvious choice would be to stop bisections once you reached 1-2 cachelines and do linear search within the remainder. Or to vectorize the compare, movemask and tzcnt to find the precise index without branches.
If you have 4-byte integers in 64-byte cachelines, you replace the last 4-5 compares with this. Again, benefit is relatively low because we optimize the cheap L1 accesses, not the expensive ones.
If the goal is to replace the expensive accesses, I’d try to do better than binary search. If you have a search window of 100 values between 0 and 1000 and are searching for 20, the assumption would be that you’re looking for the second entry or one very close to that. If the values are typically spread very evenly, you can guess a tight search window and skip a lot of bisections. If the value are clumped with random patterns, you need a wide search window and might revert back to binary search.
You could try to hard-code whether you assume random clumps are even distribution. Or you could do a feedback loop, where future searches are influenced by past searches. If guessing worked well in the past, do more guessing. If not, fall back to binary search.
Hwang and Lin have a similar bound in their paper, although they use a slightly different choice formulation (it appears you allow duplicates). Their lower bound is basically log2((n+k) choose k).
The UB complexity of their algorithm g is similar: log(n/k) and change. It comes from probing the n/k-th element first and binary-searching within that range only.
Interesting reference. Though I have not read it, I have come away with a similar idea.
Right. And it is a bit like what Christopher (another commenter) described.
And N. Kurz also privately alluded to the same idea, months ago.
It’s in Knuth 😉
Of course, it would be in Knuth.
I think you can do this in the same time bound (big O, not exact) by iterating over the smaller array in order and doing exponential search in the larger array from the front until you exceed the sought-for key, then doing binary search on the induced prefix of the larger array. On the next key from the smaller array, start your exponential search in the larger array at the index you stopped at in the previous search.
If the ith search covers b_i range in the indexes of the larger array, the total cost is 2 * sum(lg b_i), where sum(b_i) < 2n. Since lg is convex, the cost is maximized when all the b_i’s are equal, so lg (2n/k).
I suspect the constant factors could be eliminated by reusing more of the exponential overshoot when searching subsequent keys.
Thanks. This algorithm is described in the reference I offer… SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016. In this instance, I am concerned with constants and caching… so whether this is the best algorithm is in question.
(I will come back on this topic later, on this blog.)
As far as caching goes, if the large array is arranged as a level-linked B-tree with parent pointers, I think the exponential search algorithm turns into a finger search algorithm with cost O(k log_B (n/k)).