Why not do the galloping algorithm but alternating from the small array head and tail (1, N, 2, N-1, 3, …) and cache the upper and lower indices of the last binary search on the large array? Not a big saving, approximately M*log 2 fewer ops but possibly comparable.
I tried something along the suggested line recently. For each element in the small array I kept an upper and lower limit which I updated depending on the values I encountered while probing. The small array was in memory, the large array on disk. (Since disk probes are very expensive, I figured I could waste some time on updating the limits.) It worked quite nicely in the sense that it saved a lot of disk probes in typical cases.
However, the running time was less than ideal. It took me a while to figure out what was happening. Because the elements of the small array now all had different lower and upper limits, the midpoints would differ ever so slightly all the time, thereby frustrating the disk cache.
This led me to discard the idea of adjusting the limits and instead add a LRU cache to the binary search. Since a large part of the “search tree” is repeated for each lookup, this resulted in a handsome speedup!
Leonid Boytsovsays:
It has been fascinating to observe the trend from “let’s design the fastest algorithm” to “let’s design an algorithm that ||-es better”.
Dansays:
If the two arrays are sorted, can’t the same be accomplished using a ‘merge’ operation with complexity O(M) faster? (no comparison with N-array needs to be made twice)
Dansays:
To clear up my last comment, it seems your benchmark code is doing a ‘merge’ in the sense of line 115: idx_l = index1; and then line 84 says: int base1 = idx_l;. My proposed optimization would be to have 115 use index4 to really get ahead in the long array.
In any case the O(M*log N + M) remains the complexity.
It depends on whether you expect the arrays to be strictly sorted or merely sorted (possibly with repeated values).
Dansays:
Still can’t see how repeats could make idx_l = index4 fail, since index4 points to target4 if found and next target value will be at or after index4. If target4 not found index4 will be just one cell after lower-than-target4 value in long list, and next target in short list must surely be higher by non-strict monotonicity (probably should be coding and running this myself).
Deliberately, I am giving rough numbers with one or two significant digits. My timings are not going to be affected by micro-optimizations.
Leonid Boytsovsays:
I think the old complexity definition largely doesn’t work nowadays. For posting processing we have a bunch of “same-complexity” methods to do unions and intersections, but they have very different run-times easily orders of magnitude different.
I agree. Even on the JVM, the same or worse complexity can win when some exposed feature of the underlying hardware can be utilised.
Travis Downssays:
Indeed.
Sometimes the problem is simply that the wrong thing is being counted. E.g., most searching algorithms give the complexity in the number of “comparisons”.
However, for searching in large arrays or many small arrays (such that the aggregate is large), what dominates is the memory access time, so you’d be better of counting something like “unique memory locations accessed”, with the idea that once you’ve accessed a location once, it’s basically free to access it again (it’s cached). Of course, even that’s not exactly right: since, for example, caching isn’t forever and also adjacent locations (within a cache line) will also come along for the ride, so you maybe you should count “unique cache lines accessed” – but if you go far enough down that rabbit hole you end up building a complete machine model.
Leonid Boytsovsays:
This, but also the effect of parallelization that is hard to take into account many times. Things that run in parallel are often for run by using little or no resources.
Frank Tetzelsays:
Looks like interleaved execution but done by hand. I don’t think that interleaving all operation one-by-one gives any benefit. Only the memory access in parallel results in a speedup.
I understand your comment better now. Yes, this is even a lower level of interleaving than the approaches presented there, which break the work into resumable “tasks” – either by hand using an array of state objects, or using coroutines TS and then “schedule” these tasks round-robin using a more-or-less generic scheduler.
Here, the algorithm is simply modified to process within the main loop, more than one element at a time. It’s a less generic approach (in particular, you cannot easily vary the number of interleaved elements), but in principle it could be since the barrier between the scheduler and the tasks is removed. For example, either task based approach will probably never keep state values in registers even if there are few enough tasks that it would be feasible, but in this type hard-coded batched loop approach it is possible.
Now the relative results here are much worse: a speedup of ~1.4x vs more than 3x for co-routines, but this have more to do with Java vs C++ than any inherent deficiency.
Frank Tetzelsays:
There is still a fundamental difference: the number of computations between memory accesses.
In your shotgun approach you have all memory accesses right after each other. The next operations require the loaded values. There is not enough computation to load the first value from main memory, resulting in memory stalls. By grouping all operations before the memory access together and only issue a prefetch, we increase the number of computations before we use the loaded value. If we have N searches, (N-1) other searches execute all the operations leading to the next prefetch. We hide the memory latency with useful computation resulting in reduced CPU stalls.
Travis Downssays:
In this example of binary search (that’s the same example as the linked CppCon talk), there is essentially no computation. Maybe something like 1 or 2 cycles of work (a shift, compare and branch) pretty much, compared to the 100s of cycles needed for the memory access. So the compared to the memory access time, the time needed for computation rounds down to 0.
So the goal here is not to overlap memory access with computation, but memory access with more memory accesses. That’s how the coroutine method gets its speedup.
In either Daniels code or the coroutine case, ignoring the complication of prefetching vs loading, the pattern is the same: a bunch of loads, then a bunch of “computation” (not actually a bunch at all). So having “all memory accesses right after each other” is fine (and coroutine does the same thing, at least if the arrays are the same size) – the only requirement is that the adjacent loads are independent so they can all execute without waiting for any other one.
Within a single search, the loads are all dependent – when considering the whole group of searches, there are many independent loads (i.e., each new search involves loads independent of all the other ones), but the CPU can’t “see” them in the middle of a single search because all in can see is a stream of dependent loads all part of the current search. Its internal resources will often fill up before it speculates into the next search (and even then you are only at 2 parallel searches: you want many more).
So interleaving is all about reorganizing the stream of accesses so any random window of say 50-200 instructions contains a lot of independent accesses. The code Daniel presents here and coroutines both accomplish that.
Frank Tetzelsays:
Looks like you are right about the computation. There is just not enough in a binary search. I did some quick&dirty measurements with the following code:
Group prefetching (simpler version of what coroutines do) and plain interleaving (called shotgun in the article) have pretty much the same performance on my machine. I have not thoroughly profiled it and only briefly checked the assembly (gcc creates branchless code (cmov) for both). I also threw in an AVX2 version. On Haswell it’s slower as the gather instruction is not performant enough. Might be better on Skylake.
Just a testbed to get some numbers into the discussion. And, eventually, to try it on other memory latency bound algorithms.
Travis Downssays:
I tested your SIMD code and it is good on Skylake, the 2-way version being at least twice as fast as the fastest “plain4” interleaved variant.
The 2-way SIMD version was nearly twice as fast as the 1-way SIMD version which was itself slightly faster than the fastest plain variants.
Travis Downssays:
Looks like interleaved execution but done by hand. I don’t think that
interleaving all operation one-by-one gives any benefit. Only the
memory access in parallel results in a speedup.
Well I think Daniel’s result here shows that it does give a benefit, right?
In particular, the parallel memory access is made possible (or at least “more possible”) by interleaving since there are more independent memory operations visible in the out-of-order window of the processor. Typical binary search is kind of a worst case for memory parallelism since each access in the search depends on the prior one, either via a control dependence (with a probability of a wrong prediction and hence a pointless probe) or a data dependency (even worse since now all accesses are serialized).
So yeah, “only the memory access in parallel results in a speedup”, but that’s like saying “only the sugar and chocolate in this brownie results in deliciousness” or something to that effect.
I will check out the work you linked – but my initial question is always “how much does the abstraction cost” – even a few instructions limits how fine-grained you can do the switching. So I am very interested in this stuff but still in “see it when I believe it” mode.
Frank Tetzelsays:
The problem is that with interleaving each operation you carry a lot of state around. If every value still fits in the registers, it’s probably faster than other approaches. But if it doesn’t, one loads and stores too often from and to the stack. Sure stack is in L1, but still. I would assume that it limits the number of searches you can interleave too much. On Intel CPUs, we can have up to 10 memory loads in flight. Getting close to 10 interleaved searches would be ideal.
Travis Downssays:
Yes there is definitely a limit to the interleaving before code quality drops and you start spilling registers, and this limit depends in the number of registers available (obviously) and also the code in question. Sometimes you can come up with a level transformation that reduces the number of registers used.
In this example, I think you only need one additional register per interleaved search, plus a few other registers (whose number doesn’t depend on the number of searches) as temporaries and to track the current search span – so you could in principle get to your desired limit of 10 on an 16-reigster architecture.
As you point out, spilling isn’t necessarily all that bad if you fighting main memory latencies as your primary limiting factor, since spills to L1 take only a small fraction of the time of a miss to main memory (but spills also tend to clog up the out of order window with more instructions, which may also limit the achieved memory parallelism).
Coroutines don’t have any magic way around spills though, do they? They also have to dealing with saving and restoring registers, and I don’t see them having access to any tricks that the register allocator in a compiler would not (and on the contrary, the “single moment in time” switch implies additional restrictions which result in more spills.
Maynard Handleysays:
But that’s always the case. Optimal code has to target the particular properties of the machine, and there are limits to how much you can hope for the compiler intuiting your goal and appropriately matching HW!
For example back in the day (early 90s) I wrote a DFT to handle AAC audio. Most practical books presented radix-2 code, and for x86 that was probably the optimal solution given the paucity of x87 registers. But for PPC I could just fit a radix-8 DFT into the 32 FP registers.
This seems to me analogous — optimal structure (ie optimal “interleaving”) for x86 may differ from optimal structure for “generic” ARM which will in turn differ from optimal structure for Apple ARM.
Why not do the galloping algorithm but alternating from the small array head and tail (1, N, 2, N-1, 3, …) and cache the upper and lower indices of the last binary search on the large array? Not a big saving, approximately M*log 2 fewer ops but possibly comparable.
I’d love to see a benchmark of your idea. Did you try to code it up?
I tried something along the suggested line recently. For each element in the small array I kept an upper and lower limit which I updated depending on the values I encountered while probing. The small array was in memory, the large array on disk. (Since disk probes are very expensive, I figured I could waste some time on updating the limits.) It worked quite nicely in the sense that it saved a lot of disk probes in typical cases.
However, the running time was less than ideal. It took me a while to figure out what was happening. Because the elements of the small array now all had different lower and upper limits, the midpoints would differ ever so slightly all the time, thereby frustrating the disk cache.
This led me to discard the idea of adjusting the limits and instead add a LRU cache to the binary search. Since a large part of the “search tree” is repeated for each lookup, this resulted in a handsome speedup!
It has been fascinating to observe the trend from “let’s design the fastest algorithm” to “let’s design an algorithm that ||-es better”.
If the two arrays are sorted, can’t the same be accomplished using a ‘merge’ operation with complexity O(M) faster? (no comparison with N-array needs to be made twice)
To clear up my last comment, it seems your benchmark code is doing a ‘merge’ in the sense of line 115:
idx_l = index1;
and then line 84 says:int base1 = idx_l;
. My proposed optimization would be to have 115 useindex4
to really get ahead in the long array.In any case the O(M*log N + M) remains the complexity.
It depends on whether you expect the arrays to be strictly sorted or merely sorted (possibly with repeated values).
Still can’t see how repeats could make
idx_l = index4
fail, sinceindex4
points totarget4
if found and next target value will be at or afterindex4
. Iftarget4
not foundindex4
will be just one cell after lower-than-target4
value in long list, and next target in short list must surely be higher by non-strict monotonicity (probably should be coding and running this myself).You might be right.
I think you are right. I have updated the code.
It might be a bit faster. New timing?
Deliberately, I am giving rough numbers with one or two significant digits. My timings are not going to be affected by micro-optimizations.
I think the old complexity definition largely doesn’t work nowadays. For posting processing we have a bunch of “same-complexity” methods to do unions and intersections, but they have very different run-times easily orders of magnitude different.
I agree. Even on the JVM, the same or worse complexity can win when some exposed feature of the underlying hardware can be utilised.
Indeed.
Sometimes the problem is simply that the wrong thing is being counted. E.g., most searching algorithms give the complexity in the number of “comparisons”.
However, for searching in large arrays or many small arrays (such that the aggregate is large), what dominates is the memory access time, so you’d be better of counting something like “unique memory locations accessed”, with the idea that once you’ve accessed a location once, it’s basically free to access it again (it’s cached). Of course, even that’s not exactly right: since, for example, caching isn’t forever and also adjacent locations (within a cache line) will also come along for the ride, so you maybe you should count “unique cache lines accessed” – but if you go far enough down that rabbit hole you end up building a complete machine model.
This, but also the effect of parallelization that is hard to take into account many times. Things that run in parallel are often for run by using little or no resources.
Looks like interleaved execution but done by hand. I don’t think that interleaving all operation one-by-one gives any benefit. Only the memory access in parallel results in a speedup.
There are recent publications using C++ coroutines for interleaved execution which provide a nicer abstraction and leave the algorithm mostly unchanged.
http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
There was also a recent CppCon talk about the topic:
https://www.youtube.com/watch?v=j9tlJAqMV7U
Thanks, the cppcon talk is well worth watching.
I understand your comment better now. Yes, this is even a lower level of interleaving than the approaches presented there, which break the work into resumable “tasks” – either by hand using an array of state objects, or using coroutines TS and then “schedule” these tasks round-robin using a more-or-less generic scheduler.
Here, the algorithm is simply modified to process within the main loop, more than one element at a time. It’s a less generic approach (in particular, you cannot easily vary the number of interleaved elements), but in principle it could be since the barrier between the scheduler and the tasks is removed. For example, either task based approach will probably never keep state values in registers even if there are few enough tasks that it would be feasible, but in this type hard-coded batched loop approach it is possible.
Now the relative results here are much worse: a speedup of ~1.4x vs more than 3x for co-routines, but this have more to do with Java vs C++ than any inherent deficiency.
There is still a fundamental difference: the number of computations between memory accesses.
In your shotgun approach you have all memory accesses right after each other. The next operations require the loaded values. There is not enough computation to load the first value from main memory, resulting in memory stalls. By grouping all operations before the memory access together and only issue a prefetch, we increase the number of computations before we use the loaded value. If we have N searches, (N-1) other searches execute all the operations leading to the next prefetch. We hide the memory latency with useful computation resulting in reduced CPU stalls.
In this example of binary search (that’s the same example as the linked CppCon talk), there is essentially no computation. Maybe something like 1 or 2 cycles of work (a shift, compare and branch) pretty much, compared to the 100s of cycles needed for the memory access. So the compared to the memory access time, the time needed for computation rounds down to 0.
So the goal here is not to overlap memory access with computation, but memory access with more memory accesses. That’s how the coroutine method gets its speedup.
In either Daniels code or the coroutine case, ignoring the complication of prefetching vs loading, the pattern is the same: a bunch of loads, then a bunch of “computation” (not actually a bunch at all). So having “all memory accesses right after each other” is fine (and coroutine does the same thing, at least if the arrays are the same size) – the only requirement is that the adjacent loads are independent so they can all execute without waiting for any other one.
Within a single search, the loads are all dependent – when considering the whole group of searches, there are many independent loads (i.e., each new search involves loads independent of all the other ones), but the CPU can’t “see” them in the middle of a single search because all in can see is a stream of dependent loads all part of the current search. Its internal resources will often fill up before it speculates into the next search (and even then you are only at 2 parallel searches: you want many more).
So interleaving is all about reorganizing the stream of accesses so any random window of say 50-200 instructions contains a lot of independent accesses. The code Daniel presents here and coroutines both accomplish that.
Looks like you are right about the computation. There is just not enough in a binary search. I did some quick&dirty measurements with the following code:
https://github.com/tetzank/interleaved-algorithms
Group prefetching (simpler version of what coroutines do) and plain interleaving (called shotgun in the article) have pretty much the same performance on my machine. I have not thoroughly profiled it and only briefly checked the assembly (gcc creates branchless code (cmov) for both). I also threw in an AVX2 version. On Haswell it’s slower as the gather instruction is not performant enough. Might be better on Skylake.
Just a testbed to get some numbers into the discussion. And, eventually, to try it on other memory latency bound algorithms.
I tested your SIMD code and it is good on Skylake, the 2-way version being at least twice as fast as the fastest “plain4” interleaved variant.
The 2-way SIMD version was nearly twice as fast as the 1-way SIMD version which was itself slightly faster than the fastest plain variants.
Well I think Daniel’s result here shows that it does give a benefit, right?
In particular, the parallel memory access is made possible (or at least “more possible”) by interleaving since there are more independent memory operations visible in the out-of-order window of the processor. Typical binary search is kind of a worst case for memory parallelism since each access in the search depends on the prior one, either via a control dependence (with a probability of a wrong prediction and hence a pointless probe) or a data dependency (even worse since now all accesses are serialized).
So yeah, “only the memory access in parallel results in a speedup”, but that’s like saying “only the sugar and chocolate in this brownie results in deliciousness” or something to that effect.
I will check out the work you linked – but my initial question is always “how much does the abstraction cost” – even a few instructions limits how fine-grained you can do the switching. So I am very interested in this stuff but still in “see it when I believe it” mode.
The problem is that with interleaving each operation you carry a lot of state around. If every value still fits in the registers, it’s probably faster than other approaches. But if it doesn’t, one loads and stores too often from and to the stack. Sure stack is in L1, but still. I would assume that it limits the number of searches you can interleave too much. On Intel CPUs, we can have up to 10 memory loads in flight. Getting close to 10 interleaved searches would be ideal.
Yes there is definitely a limit to the interleaving before code quality drops and you start spilling registers, and this limit depends in the number of registers available (obviously) and also the code in question. Sometimes you can come up with a level transformation that reduces the number of registers used.
In this example, I think you only need one additional register per interleaved search, plus a few other registers (whose number doesn’t depend on the number of searches) as temporaries and to track the current search span – so you could in principle get to your desired limit of 10 on an 16-reigster architecture.
As you point out, spilling isn’t necessarily all that bad if you fighting main memory latencies as your primary limiting factor, since spills to L1 take only a small fraction of the time of a miss to main memory (but spills also tend to clog up the out of order window with more instructions, which may also limit the achieved memory parallelism).
Coroutines don’t have any magic way around spills though, do they? They also have to dealing with saving and restoring registers, and I don’t see them having access to any tricks that the register allocator in a compiler would not (and on the contrary, the “single moment in time” switch implies additional restrictions which result in more spills.
But that’s always the case. Optimal code has to target the particular properties of the machine, and there are limits to how much you can hope for the compiler intuiting your goal and appropriately matching HW!
For example back in the day (early 90s) I wrote a DFT to handle AAC audio. Most practical books presented radix-2 code, and for x86 that was probably the optimal solution given the paucity of x87 registers. But for PPC I could just fit a radix-8 DFT into the 32 FP registers.
This seems to me analogous — optimal structure (ie optimal “interleaving”) for x86 may differ from optimal structure for “generic” ARM which will in turn differ from optimal structure for Apple ARM.