I think that if your integer lists were (sorted and) dense, you could theoretically go significantly below two reads and one write per element on the update operation by partially transposing the update loop, especially if you can benefit from wide reads and writes (AVX2/AVX-512). The optimal case would cause only the integer lists to be read once with the widest possible read size, and the output array to be written once(!) with maximum width, that is, very close to zero memory operations per element. (Actually you could collect the results straight away at this phase.)
The problem here is all the juggling required in the middle. Maybe some of those very un-RISCy vector instructions could save the day? (I don’t believe current architectures would be that good, or at least this would require significant register spilling which would undo large portion of potential benefits.)
Jörn Engelsays:
The most common trick in this direction is to use the AH/AL pair for byte reads or writes. Requires minimal loop unrolling.
read(AL, foo);
read(AH, bar);
write(AX, baz); // writes two bytes in one instruction
It doesn’t always help and compilers tend to forget about this optimization once you get into register pressure. But it allows you to double the number of byte reads/writes without writing vector code.
foobarsays:
That is not the hard part in this case, the hard part is expanding an array of sorted integers into a corresponding bit-mappy vector of 1/0 integers (probably of the same size) in an efficient manner. 🙂
I think this could be accomplished with AVX-512 with a core loop magic consisting basically of a broadcast, a compare to mask and a VPERM instruction. Reasoning about AVX-512 instructions sincerely makes my head hurt, but it would seem plausible that this kind of a loop could process 8 32-bit input integers per one loop iteration for 50% dense integer list. What would probably be the performance killer is conditional branching…
foobarsays:
Apparently my idea was naive, but you can still perform the following operation relatively efficiently on AVX-512:
A list of sorted unique integers (32 32-bit integers in a pair of AVX-512 registers):
0, 3, 4, 6, 9, 10, 11, 20, 25, 26, 30, 35, …
can be converted to a list of integers (or maybe their two’s complements) on similarly sized vector entries:
These results could be added with a vectored operation 16 entries at a time.
My back-of-an-envelope calculation of this operation requires around 4 instructions for masking outlying values, 2 instructions for adjusting offsets, 2 variable vectored shifts, 11 operations for 32-way reduction (OR or ADD operation and shuffles) and two TEST-like compares against constant vectors to achieve this result. This is on the ballparks of 23 uops per 32 resulting output values. (No, I didn’t write the actual code…)
Density of input lists is a crucial point, but with densities of >20% this approach might be very competitive with less vectorised approaches.
foobarsays:
The benchmarks on the scancount problem are sparse, but I believe here is a chance for another toy problem (unless it’s already solved!):
How fast can you convert a sorted array of integers into the corresponding bitmap?
My bet is that naive implementation is fastest for sparse bitmaps (roughly one clock cycle per set bit, but not that fast for dense bitmaps!), but AVX-512 implementation might be faster for dense bitmaps (as a wild-assed guess, possibly something like four bits of bitmap produced per clock cycle).
Nathan Kurzsays:
Assuming that our hardware can do either two loads per cycle or one
store per cycle, then we get a limit of at least two cycles per input
element.
For modern Intel processors you can actually do two loads and a write within the same cycle. The trickiness is that the addressing mode for the write would need to be “simple”, which is to say, based on just one register plus a fixed offset (rather than a base and index register). That’s difficult to achieve in this case, since we want to index into the counter array by the value we just read. So in practice, I think you are right that this algorithm can’t achieve 1 cycle timings. But this is because of the demands of the algorithm, rather than a simple hardware limitation.
And it’s not necessarily the case that just because we can’t achieve 1.0 cycles, that the minimum jumps to 2.0 — it might be fractional. For example, a pattern of R+R, R+W, R+W gives you 4 reads and 2 writes in 3 cycles, which allows 1.5 cycles per element. Anyway, by fiddling with the range and array sizes, I’ve gotten results that are getting pretty close to 2 cycles per iteration. For example, demo(100000000, 1500000, 50, 5) gets down to 2.2 cycles/element for scalar:
I still don’t quite understand why some sizes are faster than others. I think the issue is that both the data and counter arrays are competing for the same space in L3, such that increasing the number or size of the data arrays causes significantly slower results. It’s possible that a streaming read for the data (reading in a way that avoids polluting L3) might help get things a little bit faster.
Travis Downssays:
The trickiness is that the addressing mode for the write would need to
be “simple”, which is to say, based on just one register plus a fixed
offset (rather than a base and index register). That’s difficult to
achieve in this case, since we want to index into the counter array by
the value we just read.
One way to get simple addressing is to use a static (global) array. Then the only register needed in the addressing expression is the one holding the just loaded value: the array base is a constant offset. Here’s a godbolt.
Of course, this code is not thread-safe, so if you really wanted to do this type of really, really low-level optimization and keep things thread safe, you’d have to use a mutex to protect this code, perhaps falling back to the non-static array variant if the function is called while another thread holds the lock.
One downside of this approach is that you don’t really have any control over the allocation of the array, in particular you probably won’t be able to ask for huge pages (perhaps you can if the array lives in the .bss and you do a madvise before the page is accessed – but there are a lot of issues to work out).
Another approach is to keep using a dynamically allocated counters array, but JIT-compile the core loop embedding the array offset as a static offset in the same way as above. Yes, a crazy hack.
I think the issue is that both the data and counter arrays are
competing for the same space in L3, such that increasing the number or
size of the data arrays causes significantly slower results.
How big was the counter array? I thought that most of the performance effects as probably related to L1 cache misses on the counter array: stores to L2 (i.e., stores that miss in L1 bit hit in L2) have a speed limit of 3 cycles per store, so any result less than 3 cycles indicates that you are getting at least some hits in L1 for the counter increments, which I think requires a small counter array (or dense values, i.e., more than one hit in the same cache line).
Travis Downssays:
I was using textbook C++ with iterators and STL algorithms. Converting
the problem into lower-level code (using pointers or indexes) seems
better for performance.
I would caution readers from using this as any kind of general rule.
In fact, for the aliasing stuff, using iterators or even more idiomatic C++ such as for-each loops would have helped also, versus say indexing with vector[i] (it may not be simple if you wanted the value of i for another purpose in the loop).
Some C++ abstractions are zero cost, some are not. As we’ve seen here, replacing plain array indexing with vector indexing is not always zero cost, especially in the presence of char writs and the consequent aliasing problems.
You can still make it fast using C++ idioms, but indeed it might require a bit more care.
Of course, we also need to load data: say one load to get the element,
and one load to get the corresponding counter.
Strictly speaking you don’t need 1 load to get the element: these elements are adjacent, so in principle you can get say 2 elements with a 64-bit load, or 8 elements with a 256-bit vector load. In practice this doesn’t help too much because your “budget” of stuff you can do per cycle (4 ops) is too small for this to save much – but it’s something to keep in mind when one input or output is contiguous.
I think a carefully tuned implementation could achieve ~1.5 cycles per cycle for your parameters (20M, 50K, 100, 3) which are kind of a worst case (admittedly the “worst case” region is very large). At some other points you could do better, e.g., when the data arrays are dense, since vectorized approaches become feasible, or when the threshold is much larger (since you can eliminate many elements without performing the corresponding store).
I think that if your integer lists were (sorted and) dense, you could theoretically go significantly below two reads and one write per element on the update operation by partially transposing the update loop, especially if you can benefit from wide reads and writes (AVX2/AVX-512). The optimal case would cause only the integer lists to be read once with the widest possible read size, and the output array to be written once(!) with maximum width, that is, very close to zero memory operations per element. (Actually you could collect the results straight away at this phase.)
The problem here is all the juggling required in the middle. Maybe some of those very un-RISCy vector instructions could save the day? (I don’t believe current architectures would be that good, or at least this would require significant register spilling which would undo large portion of potential benefits.)
The most common trick in this direction is to use the AH/AL pair for byte reads or writes. Requires minimal loop unrolling.
read(AL, foo);
read(AH, bar);
write(AX, baz); // writes two bytes in one instruction
It doesn’t always help and compilers tend to forget about this optimization once you get into register pressure. But it allows you to double the number of byte reads/writes without writing vector code.
That is not the hard part in this case, the hard part is expanding an array of sorted integers into a corresponding bit-mappy vector of 1/0 integers (probably of the same size) in an efficient manner. 🙂
I think this could be accomplished with AVX-512 with a core loop magic consisting basically of a broadcast, a compare to mask and a VPERM instruction. Reasoning about AVX-512 instructions sincerely makes my head hurt, but it would seem plausible that this kind of a loop could process 8 32-bit input integers per one loop iteration for 50% dense integer list. What would probably be the performance killer is conditional branching…
Apparently my idea was naive, but you can still perform the following operation relatively efficiently on AVX-512:
A list of sorted unique integers (32 32-bit integers in a pair of AVX-512 registers):
0, 3, 4, 6, 9, 10, 11, 20, 25, 26, 30, 35, …
can be converted to a list of integers (or maybe their two’s complements) on similarly sized vector entries:
1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0 (35 didn’t fit on first 32 entries)
These results could be added with a vectored operation 16 entries at a time.
My back-of-an-envelope calculation of this operation requires around 4 instructions for masking outlying values, 2 instructions for adjusting offsets, 2 variable vectored shifts, 11 operations for 32-way reduction (OR or ADD operation and shuffles) and two TEST-like compares against constant vectors to achieve this result. This is on the ballparks of 23 uops per 32 resulting output values. (No, I didn’t write the actual code…)
Density of input lists is a crucial point, but with densities of >20% this approach might be very competitive with less vectorised approaches.
The benchmarks on the scancount problem are sparse, but I believe here is a chance for another toy problem (unless it’s already solved!):
How fast can you convert a sorted array of integers into the corresponding bitmap?
That is essentially the reverse of this problem: https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/
My bet is that naive implementation is fastest for sparse bitmaps (roughly one clock cycle per set bit, but not that fast for dense bitmaps!), but AVX-512 implementation might be faster for dense bitmaps (as a wild-assed guess, possibly something like four bits of bitmap produced per clock cycle).
For modern Intel processors you can actually do two loads and a write within the same cycle. The trickiness is that the addressing mode for the write would need to be “simple”, which is to say, based on just one register plus a fixed offset (rather than a base and index register). That’s difficult to achieve in this case, since we want to index into the counter array by the value we just read. So in practice, I think you are right that this algorithm can’t achieve 1 cycle timings. But this is because of the demands of the algorithm, rather than a simple hardware limitation.
And it’s not necessarily the case that just because we can’t achieve 1.0 cycles, that the minimum jumps to 2.0 — it might be fractional. For example, a pattern of R+R, R+W, R+W gives you 4 reads and 2 writes in 3 cycles, which allows 1.5 cycles per element. Anyway, by fiddling with the range and array sizes, I’ve gotten results that are getting pretty close to 2 cycles per iteration. For example, demo(100000000, 1500000, 50, 5) gets down to 2.2 cycles/element for scalar:
nate@skylake:~/git/fastscancount$ perf record counter
Got 9924 hits
optimized cache-sensitive scancount
2.19437 cycles/element
3.66452 instructions/cycles
0.00227307 miss/element
AVX2-based scancount
2.48756 cycles/element
2.5749 instructions/cycles
0.00199687 miss/element
I still don’t quite understand why some sizes are faster than others. I think the issue is that both the data and counter arrays are competing for the same space in L3, such that increasing the number or size of the data arrays causes significantly slower results. It’s possible that a streaming read for the data (reading in a way that avoids polluting L3) might help get things a little bit faster.
One way to get simple addressing is to use a static (global) array. Then the only register needed in the addressing expression is the one holding the just loaded value: the array base is a constant offset. Here’s a godbolt.
Of course, this code is not thread-safe, so if you really wanted to do this type of really, really low-level optimization and keep things thread safe, you’d have to use a mutex to protect this code, perhaps falling back to the non-static array variant if the function is called while another thread holds the lock.
One downside of this approach is that you don’t really have any control over the allocation of the array, in particular you probably won’t be able to ask for huge pages (perhaps you can if the array lives in the .bss and you do a madvise before the page is accessed – but there are a lot of issues to work out).
Another approach is to keep using a dynamically allocated counters array, but JIT-compile the core loop embedding the array offset as a static offset in the same way as above. Yes, a crazy hack.
How big was the counter array? I thought that most of the performance effects as probably related to L1 cache misses on the counter array: stores to L2 (i.e., stores that miss in L1 bit hit in L2) have a speed limit of 3 cycles per store, so any result less than 3 cycles indicates that you are getting at least some hits in L1 for the counter increments, which I think requires a small counter array (or dense values, i.e., more than one hit in the same cache line).
I would caution readers from using this as any kind of general rule.
In fact, for the aliasing stuff, using iterators or even more idiomatic C++ such as for-each loops would have helped also, versus say indexing with
vector[i]
(it may not be simple if you wanted the value of i for another purpose in the loop).Some C++ abstractions are zero cost, some are not. As we’ve seen here, replacing plain array indexing with vector indexing is not always zero cost, especially in the presence of char writs and the consequent aliasing problems.
You can still make it fast using C++ idioms, but indeed it might require a bit more care.
Strictly speaking you don’t need 1 load to get the element: these elements are adjacent, so in principle you can get say 2 elements with a 64-bit load, or 8 elements with a 256-bit vector load. In practice this doesn’t help too much because your “budget” of stuff you can do per cycle (4 ops) is too small for this to save much – but it’s something to keep in mind when one input or output is contiguous.
I think a carefully tuned implementation could achieve ~1.5 cycles per cycle for your parameters (20M, 50K, 100, 3) which are kind of a worst case (admittedly the “worst case” region is very large). At some other points you could do better, e.g., when the data arrays are dense, since vectorized approaches become feasible, or when the threshold is much larger (since you can eliminate many elements without performing the corresponding store).
I agree with you. My statement was not meant as a prescription.