Daniel is probably already viewing it this way, but readers might benefit from making the example more concrete.
Assume that the lists of sorted integers are “posting lists” from an inverted index, with each integer representing a document in which a word appears. Assume that we have a “query” that consists of 100 words, so that each of the 100 lists of integers represent all the documents that contain a given word. We’d like to return as the result of the query all documents that contain at least 3 of the words. How can we do this efficiently? And while we are at it, wouldn’t it be nice if we could sort the results (the list of documents that contain the words) based on how many of the terms they contain?
Operations like this are essential to the functioning of search engines, and thus making them more efficient is a big deal.
I have implemented here some “T-overlap” algorithms such as
“ScanCount”, “MergeSkip”, “DivideSkip”, “CPMerge” and there is a really good performance for “CPMerge” algorithm described here
Thanks for that reference; I was wondering if anyone had any use for this type of algorithm — I came up with MergeSkip in the early 00’s as part of a frequent itemset finder. Any iterator that can “skip” in sublinear time can benefit; I believe even graph cliques are possible.
My version had some tweaks that might help performance — I always kept T elements below the heap in a ring buffer instead of dynamically expanding each run. I started each iteration by swapping the tail of the FIFO with the top of the heap and rotating the ring head pointer. The new heap top would then skip and sift down as in MergeSkip.
One of my todo’s is to look into how this method might compare for simpler types like arrays — thanks for helping to answer that question.
Interesting design! I definitely will try to implement it. My benchmarks show that the bottleneck of this algorithm is a heap: Pop and Push. Thank you for the idea
KWilletssays:
It looks like Go’s heap has a Fix method that allows you to update element 0 and sift down, so that’s good (many heap implementations only allow sift-up). The skip operation seems like it would produce a new value near the minimum.
The ring buffer to be clear had only one pointer, as it was always full, making head/tail adjacent. A standard two-pointer version might have more overhead.
Travis Downssays:
I was able to get it down to around 4 cycles/element with a few changes.
Also, on my machine (Skylake i7-6700HQ) the benefit of the cache-blocked technique is even better than you show: I still get ~16 cycles for the blocked approach, but usually around 50 cycles for other one. It may depend on activity on my system.
I was able to get it down to around 4 cycles/element with a few
changes.
Can you tell us more?
Nathan Kurzsays:
Let’s flip the question and ask: When it took 16 cycles to do what should be a load, a compare, and a write per element, where did you think the extra time was going?
Having not heard back from Travis, I played with it a little. The first thing was to use “-Ofast” instead of “-O2”, which got me down from 16 cycles to 12. Beyond that I had to use “perf record”. The next thing I noticed was that the second loop that checks the threshold and writes the hits takes a surprising amount of time. This can easily be combined with the other loop, so that a value gets written as soon as it equals the threshold. This gets down from 12 to 8.
The next step was to blame C++. Well, maybe that’s just me. But I think there is a big abstraction issue that’s causing it to do a lot more reads and writes than should be necessary. Every time you call “it++” it’s writing the new value to memory (well, cache) and then rereading it on the next iteration.
Worse, it seems like almost everything else is working off the stack too. The IO for the inner loop should be basically a single read and write, but instead it’s got two writes and a half-dozen reads. By the time I figured out what the assembly was doing, my C++ allergy was making it hard for me to breath normally, so I didn’t actually try to fix this. But I think if you were to write something straightforward in C, I think you’d easily get down to 4 cycles, possibly fewer. Travis is stronger than I am, and probably managed to avoid the excessive IO without switching languages.
Travis Downssays:
Yeah the constant reloading from memory thing was interesting enough that I wrote a post about it:
The motivation was exactly this problem. Basically once you have writes to char (or uint8_t) arrays in some loop, you better make sure everything else is an unescaped local variable.
Sorry I didn’t get back to this: I got to 4 cycles quickly (basically std:fill -> memset, fix the iterator problems Nathan mentions, vectorize the final scanning of the counters array), but of course I wanted more. I tried a few things that got close to 3 cycles but then also tried a bunch of unproductive things and got the code in an ugly state without being faster so I didn’t want to send a PR with that mess.
The speed of light here is 1 cycle since the only absolutely compulsory thing that is O(N) in the size of the input elements seems to be the scattered writes, and I can’t see an easy way to vectorize that (it’s basically similar to radix sort, which is also limited by the scattered writes). It seems hard to get to 1, but 2 could be possible.
This can easily be combined with the other loop, so that a value gets
written as soon as it equals the threshold.
That’s a good idea. I took a different approach of vectorizing the scan of the counter array: it’s almost all zeros so this goes fast, but I feel your approach may be faster if it can slip in for free among all the other work. I’m going to try it.
Travis Downssays:
it’s almost all zeros so this goes fast
Sorry, that should say “it’s almost all values below the threshold”, and that still goes fast. Only about 2,500 hits over the entire input, so the scanning has to be fast but the hit handling doesn’t.
Nathan Kurzsays:
Great writeup! Comparing it to my semi-coherent blog post reply, I feel slightly outdone by your complete and coherent explanation. I hadn’t figured out that the char-type aliasing was the base issue.
I’m still confused why the generated assembly is rereading all the constants from memory in the inner loop. I tried changing all the uint8_t’s to uint32_t’s (which solves the aliasing issues at the cost of some cache), but GCC still rereads things like ‘threshold’ from the stack on every iteration. I’m not sure that this is a performance issue (clang doesn’t seem do it and ends up slower), but it seems like a silly choice. Is there a good way to convince GCC to behave more sensibly?
Travis Downssays:
Yes, what you noticed about those reads comes from another effect: the function is complicated enough that it runs out of registers and gcc makes some not ideal choices about which registers to spill, and this results in it reading spilled regs from the stack in the inner loop.
I work around this by using “noinline” to force the loop to a standalone function, where it gets a full set of registers and so doesn’t need to spill. An interesting case of forcing things out of line causing it to speed up, without having anything to do with code size effects (indeed, the inline version may be smaller).
Daniel is probably already viewing it this way, but readers might benefit from making the example more concrete.
Assume that the lists of sorted integers are “posting lists” from an inverted index, with each integer representing a document in which a word appears. Assume that we have a “query” that consists of 100 words, so that each of the 100 lists of integers represent all the documents that contain a given word. We’d like to return as the result of the query all documents that contain at least 3 of the words. How can we do this efficiently? And while we are at it, wouldn’t it be nice if we could sort the results (the list of documents that contain the words) based on how many of the terms they contain?
Operations like this are essential to the functioning of search engines, and thus making them more efficient is a big deal.
Thank you Nathan.
Nice article!
I have implemented here some “T-overlap” algorithms such as
“ScanCount”, “MergeSkip”, “DivideSkip”, “CPMerge” and there is a really good performance for “CPMerge” algorithm described here
Please see my other comment in this thread; it was meant to be a reply to this comment.
Thanks for that reference; I was wondering if anyone had any use for this type of algorithm — I came up with MergeSkip in the early 00’s as part of a frequent itemset finder. Any iterator that can “skip” in sublinear time can benefit; I believe even graph cliques are possible.
My version had some tweaks that might help performance — I always kept T elements below the heap in a ring buffer instead of dynamically expanding each run. I started each iteration by swapping the tail of the FIFO with the top of the heap and rotating the ring head pointer. The new heap top would then skip and sift down as in MergeSkip.
One of my todo’s is to look into how this method might compare for simpler types like arrays — thanks for helping to answer that question.
Interesting design! I definitely will try to implement it. My benchmarks show that the bottleneck of this algorithm is a heap:
Pop
andPush
. Thank you for the ideaIt looks like Go’s heap has a Fix method that allows you to update element 0 and sift down, so that’s good (many heap implementations only allow sift-up). The skip operation seems like it would produce a new value near the minimum.
The ring buffer to be clear had only one pointer, as it was always full, making head/tail adjacent. A standard two-pointer version might have more overhead.
I was able to get it down to around 4 cycles/element with a few changes.
Also, on my machine (Skylake i7-6700HQ) the benefit of the cache-blocked technique is even better than you show: I still get ~16 cycles for the blocked approach, but usually around 50 cycles for other one. It may depend on activity on my system.
Can you tell us more?
Let’s flip the question and ask: When it took 16 cycles to do what should be a load, a compare, and a write per element, where did you think the extra time was going?
Having not heard back from Travis, I played with it a little. The first thing was to use “-Ofast” instead of “-O2”, which got me down from 16 cycles to 12. Beyond that I had to use “perf record”. The next thing I noticed was that the second loop that checks the threshold and writes the hits takes a surprising amount of time. This can easily be combined with the other loop, so that a value gets written as soon as it equals the threshold. This gets down from 12 to 8.
The next step was to blame C++. Well, maybe that’s just me. But I think there is a big abstraction issue that’s causing it to do a lot more reads and writes than should be necessary. Every time you call “it++” it’s writing the new value to memory (well, cache) and then rereading it on the next iteration.
Worse, it seems like almost everything else is working off the stack too. The IO for the inner loop should be basically a single read and write, but instead it’s got two writes and a half-dozen reads. By the time I figured out what the assembly was doing, my C++ allergy was making it hard for me to breath normally, so I didn’t actually try to fix this. But I think if you were to write something straightforward in C, I think you’d easily get down to 4 cycles, possibly fewer. Travis is stronger than I am, and probably managed to avoid the excessive IO without switching languages.
Yeah the constant reloading from memory thing was interesting enough that I wrote a post about it:
https://travisdowns.github.io/blog/2019/08/26/vector-inc.html
The motivation was exactly this problem. Basically once you have writes to
char
(oruint8_t
) arrays in some loop, you better make sure everything else is an unescaped local variable.Sorry I didn’t get back to this: I got to 4 cycles quickly (basically std:fill -> memset, fix the iterator problems Nathan mentions, vectorize the final scanning of the counters array), but of course I wanted more. I tried a few things that got close to 3 cycles but then also tried a bunch of unproductive things and got the code in an ugly state without being faster so I didn’t want to send a PR with that mess.
The speed of light here is 1 cycle since the only absolutely compulsory thing that is O(N) in the size of the input elements seems to be the scattered writes, and I can’t see an easy way to vectorize that (it’s basically similar to radix sort, which is also limited by the scattered writes). It seems hard to get to 1, but 2 could be possible.
That’s a good idea. I took a different approach of vectorizing the scan of the counter array: it’s almost all zeros so this goes fast, but I feel your approach may be faster if it can slip in for free among all the other work. I’m going to try it.
Sorry, that should say “it’s almost all values below the threshold”, and that still goes fast. Only about 2,500 hits over the entire input, so the scanning has to be fast but the hit handling doesn’t.
Great writeup! Comparing it to my semi-coherent blog post reply, I feel slightly outdone by your complete and coherent explanation. I hadn’t figured out that the char-type aliasing was the base issue.
I’m still confused why the generated assembly is rereading all the constants from memory in the inner loop. I tried changing all the uint8_t’s to uint32_t’s (which solves the aliasing issues at the cost of some cache), but GCC still rereads things like ‘threshold’ from the stack on every iteration. I’m not sure that this is a performance issue (clang doesn’t seem do it and ends up slower), but it seems like a silly choice. Is there a good way to convince GCC to behave more sensibly?
Yes, what you noticed about those reads comes from another effect: the function is complicated enough that it runs out of registers and gcc makes some not ideal choices about which registers to spill, and this results in it reading spilled regs from the stack in the inner loop.
I work around this by using “noinline” to force the loop to a standalone function, where it gets a full set of registers and so doesn’t need to spill. An interesting case of forcing things out of line causing it to speed up, without having anything to do with code size effects (indeed, the inline version may be smaller).