, 32 min read
Really fast bitset decoding for “average” densities
35 thoughts on “Really fast bitset decoding for “average” densities”
, 32 min read
35 thoughts on “Really fast bitset decoding for “average” densities”
Yes definitely. It seems Cortex cores can predict this better, for example Cortex-A72 has fewer than 13K mispredictions for all tests, but your results show over 20K mispredicts on x86 and Ampere.
@Wilco
Took my some time, but I built up an ARM box of my own.
I confirm that the Cortex-A72 is close to Intel Skylake as far as cycles, and mispredicted branches. Sometimes Intel does better, sometimes the A72 does better. The differences are not large from what I see (that’s what you’d expect from competitive technologies).
The Cortex-A72 is definitively superior on this benchmark than Ampere’s Skylark.
It is still inferior to Intel in the end because it does not appear to be able to cram as many instructions per cycle…
So you can reproduce the slowdown of basic_decoder. Note you’re quoting the non-inline version of basic_decoder rather than the main loop where it is inlined, which is where GCC9 adds the redundant popcount.
@Wilco
I have run the test on Apple’s A12 using the current version of Xcode…
https://github.com/lemire/iosbitmapdecoding
The numbers do not make much sense: they would imply that Apple’s A12 is highly inefficient.
The opposite is true: the A12 is highly effective on a per cycle basis: https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/
Reported on GCC bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90594
Excellent. We need good code generation.
A similar problem comes up when looking for certain bytes or patterns. Best approach is to use vector-comparisons and movemask, resulting in bitmaps and essentially your problem.
One idea I have yet to try is to sort bitmap-words based on popcount. Processing the sorted bitmap-words can completely eliminate the mispredicted branches. You have to do more memory operations, which will partially offset the misprediction savings. But more importantly the algorithm gets rather complicated, so you are trading your own time and sanity for runtime performance. So far I couldn’t bring myself to make that trade.
I have tried it and at least for my use cases it didn’t turn out faster. The sorting ate up my savings (I used a type of radix sort which I think was basically ideal for this scenario).
However, as I recall by baseline was quite a bit faster than the ~4 cycles shown here: I think it was closer to 1.5 cycles.
@Travis @Jörn
I did implement what you describe. It is in the code but not covered by the blog post…
This can be done branchlessly with AVX2+BMI2 if you use the technique described in https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask/36951611#36951611, using an array of sequential numbers as the src to the “compress” function.
With AVX512 you can do an entire 64-bit integer at once with a single instruction,
vcompressps
(_mm512_mask_compresstoreu_epi8(dest, bitset, rangeOf0Through64
).
(With AVX512_VBMI2*, not AVX512F.)
And now I see something like this mentioned in one of the further reading links :).
I think that I allude to the vectorization option in my post, with the important caveat:
If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.
So I think vector instructions will have trouble coping with the kind of data I am considering in this post.
This is not to say that you cannot do it “branchlessly”: you can. You can bring down cache misses to almost zero. The trick is to do that while also not increasing instruction count too much.
How well does a switch statement work? e.g.
int count = popcnt(word);
int* resultEnd = result + count;
switch(count) {
case 64: case 63: // ...etc
for(int i=-count+4; i; i++) {
resultEnd[i-4] = trailingzeroes(word);
word &= word-1;
}
case 4: // add more cases if desired
resultEnd[-4] = trailingzeroes(word);
word &= word-1;
case 3:
resultEnd[-3] = trailingzeroes(word);
word &= word-1;
case 2:
resultEnd[-2] = trailingzeroes(word);
word &= word-1;
case 1:
resultEnd[-1] = trailingzeroes(word);
case 0:
}
This could reduce “wasted” actions. If the exact ordering doesn’t matter, you could also skip the resultEnd calculation.
Also, the initial branch may be expensive in this case if the number of bits set isn’t always the same. You could probably do some hybrid approach where you do a more coarse grained switch if the number of bits set happens to often fall within a particular range, e.g.:
int count4 = popcnt(word) & -4;
// example for granularity = 4, change values if more appropriate
int* resultEnd = result + count4;
switch(count4) {
case 60: case 56: // ...etc
// do some loop
case 8:
resultEnd[-8] = trailingzeroes(word);
word &= word-1;
resultEnd[-7] = trailingzeroes(word);
word &= word-1;
resultEnd[-6] = trailingzeroes(word);
word &= word-1;
resultEnd[-5] = trailingzeroes(word);
word &= word-1;
case 4:
resultEnd[-4] = trailingzeroes(word);
word &= word-1;
resultEnd[-3] = trailingzeroes(word);
word &= word-1;
resultEnd[-2] = trailingzeroes(word);
word &= word-1;
resultEnd[-1] = trailingzeroes(word);
word &= word-1;
case 0:
resultEnd[0] = trailingzeroes(word);
word &= word-1;
resultEnd[1] = trailingzeroes(word);
word &= word-1;
resultEnd[2] = trailingzeroes(word);
}
Although this approach does make it much closer to what you already have.
Depending on the bits set, a fast SIMD+lookup approach may still be reasonable even if there’s a smallish number of bits typically set, e.g.
int wordPart16 = (word & 0xffff) << 4;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>12) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>28) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
result += _mm_popcnt_u32(wordPart16);
wordPart16 = (word>>44) & 0xffff0;
_mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
//result += _mm_popcnt_u32(wordPart16); // if needed
This may still be competitive if there’s only 6 bits per 64-bit word.
I have tried both of these techniques. The vectorized decoding is already part of CRoaring https://github.com/RoaringBitmap/CRoaring/blob/master/src/bitset_util.c#L553
It works and this library is in production in some large corporations.
However, I have not managed to make it competitive in the average-density scenario.
I tried the switch case approach but I could not make it run faster though I must admit that I did not do much more than just code some C and record timings.
My benchmark is rather simple: have you tried modifying it to test your ideas? Get in touch if you get promising results.
Ah, you’re using 32-bit indicies – that will seriously hurt a SIMD approach.
On the flipside, if AVX512 is usable, you can use the VCOMPRESSD technique instead, which should perform very well.
I suspect the naiive switch statement to perform better than the naiive loop approach. The performance of the jump could be problematic if there’s any unpredictable variation in the number of bits set.
The coarse grained switch works better with variation of bits set, but is more similar to the simdjson decoder. Performance ultimately depends on the density characteristics I think; simdjson is probably slightly better if <8 bits are usually set, as it can avoid a jump.
If it’s rare that there will be more than 1 bit set per 4, one could break it into nibbles and use PSHUFB to obtain indicies, which could perform okay. Looking at your sample data though, this isn’t the case, in fact, there seems to be a mix of densities, though they repeat regularly.
It looks like your benchmark requires Linux perf counters, so must be run on a Linux baremetal install (doesn’t work in a Linux VM), which makes things a little tricky for me…
Well, wasn’t too far off what I had expected (I haven’t checked these for accuracy, so these are just rough indications). Figures are cycles per word:
simdjson: 22.867
basic: 33.117
switch: 31.389
switch4: 30.199
switch8: 24.309
switch16: 23.028
vcompressd: 15.322
Regarding the switch, I’d want to have a close look at how the compiler implements it. The compiler is free to turn a switch case into branching. I am surprised by the vcompressd results.
Would you share your code? Maybe as a GitHub gist?
Yes, I confirmed that the switch is compiling to a jump, not a series of branches.
The vcompressd method is fairly straightforward:
#ifdef __AVX512F__
static inline void vcompressd(uint32_t *base_ptr, uint32_t &base,
uint32_t idx, uint64_t bits) {
base_ptr += base;
base += hamming(bits);
__m512i indicies = _mm512_add_epi32(_mm512_set1_epi32(idx), _mm512_setr_epi32(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15));
uint32_t* upper_ptr = base_ptr + _mm_popcnt_u32(bits);
_mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
// _mm_popcnt_u32 causes GCC to insert unnecessary MOVSX instructions (Clang not affected), so _mm_popcnt_u64 may be faster there
base_ptr += _mm_popcnt_u32(bits & 0xffff);
bits >>= 16;
_mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
bits >>= 16;
_mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);
indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
upper_ptr += _mm_popcnt_u32(bits & 0xffff);
bits >>= 16;
_mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);
//_mm256_zeroupper(); // automatically inserted by compiler; otherwise problematic if inlined and unrolled
}
#endif
Thank you.
Just stumbled across this – I’ll try and do some experimenting with your code because it is an intriguing problem.
One small observation for targeting Aarch64 would be that given there is neither a count trailing zeros or an unset trailing zero instruction, repeating the construction:
base_ptr[X] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
bits = bits & (bits - 1);
requires at least 5 data processing instructions:
rbit, clz, add (idx + leading_zero_count), sub (1), and
We probably should instead reverse bits once upfront.
Then we can explicitly count leading zeroes, and clear the most significant bit with 4 instructions:
clz, add (idx + leading_zero_count), asr (arithmetic shift right int64_t_min by leading_zero_count), bic (clear bits up to an including most significant set bit)
corresponding to C code something like:
lz = leadingzeroes(rev_bits);
base_ptr[X] = static_cast<uint32_t>(idx) + lz;
rev_bits = rev_bits & ~(int64_t(0x1000000000000000) >> lz);
Not tested this myself yet.
Thank you.
Implemented as follows:
void basic_arm_decoder(uint32_t *base_ptr, uint32_t &base, uint32_t idx,
uint64_t bits) {
uint64_t rev_bits;
__asm volatile ("rbit %0, %1" : "=r" (rev_bits) : "r" (bits) );
while (rev_bits != 0) {
int lz = __builtin_clzll(rev_bits);
base_ptr[base] = static_cast<uint32_t>(idx) + lz;
rev_bits = rev_bits & ~(int64_t(0x8000000000000000) >> lz);
base++;
}
}
It does save quite a bit of instructions. Whether it is faster is less clear. On at least one ARM server I have, it is actually slower despite the instruction count.
See my code and benchmark at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/bitmapdecoding.cpp
You can make the code a bit faster on GCC 8, by using alternative C construction. If you introduce an auxiliary “uint32_t* val = base_ptr + base” and then each update will be like “*val++ = static_cast(idx) + trailingzeroes(bits);” then the compiler emits a slightly better machine code.
Before: instructions per cycle 2.57, cycles per value set: 3.797
After: instructions per cycle 2.45, cycles per value set: 3.428
10% for free, not that bad 🙂
You could do better by matching the number of bits processed to typical input data. Eg. unconditionally do 6 bits, do 4 more if there are still bits left, and then loop 2 bits at a time. This reduces branch misses as well as unnecessary work when there are no more set bits.
However the most obvious way to speed this up further is to delay the decoding until you actually need it. This avoids the overhead of expanding the bitmap into a much larger data structure (max expansion is 64 times) and all associated cache overheads.
Also the latency of computing the next bit will be completely hidden, unlike in these examples where the 2-cycle latency of x = x & (x – 1) is going to dominate (processing 2 64-bit masks in parallel may avoid this latency chain, but that makes things even more complex).
Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set. Doing extra work doesn’t offset the reduction in branch misses (which are fast on Arm cores due to shallow pipelines). Changing faster_decoder to do 6 bits first, then 4 bits and finally loop on 1 bit is ~10% faster than fast_decoder and 20% than simdjson_decoder.
Btw would it be reasonable to have a #if around #include <x86intrin.h> and evts.push_back(PERF_COUNT_HW_REF_CPU_CYCLES) so the code becomes more portable? That event fails on AArch64 kernels but this causes all other events to fail too…
Thanks… I did not know that PERF_COUNT_HW_REF_CPU_CYCLES would fail on AArch64, thanks for pointing this out.
Yes, evidently, the x86intrin header needs to be guarded with appropriate checks.
Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set.
Sorry if I took some time to get back to you on this. So I verify this result initially, on an older compiler, but after reading up that the latest GNU GCC improved code generation improved I tried with GNU GCC 9… The results appear to be equivalent to what I get on x64, thus apparently contradicting your statement…
I post my results as a MarkDown file…
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/RESULTS_AARCH64_AMPERE.md
Furthermore, you will find a dockerfile to ease reproduction.
Well you found a GCC9 bug! It incorrectly adds a useless popcount after the basic_decoder loop:
mov x0, x4
.L2758:
rbit x1, x0
clz x1, x1
add w1, w3, w1
str w1, [x19, w2, uxtw 2]
sub x1, x0, #1
add w2, w2, 1
ands x0, x0, x1
bne .L2758
fmov d0, x4
cnt v0.8b, v0.8b
addv b0, v0.8b
umov w0, v0.b[0]
add w20, w20, w0
That adds extra latency and 6 instructions, slowing basic_decoder.
Interesting.
So I did additional testing.
I have updated the results in an markdown file in my repo (with the AMPERE string in the name). To ensure reproducibility, I have posted my scripts and docker files.
It seems that LLVM 8 compiles basic_decoder to the following…
As for the basic_decoder function under GCC-9, it compiles to…
It is hard for me to understand your concern with the 2-cycle dependency. We are using, in the best case, 8.5 cycles per set bit… far more than 2 cycles.
Note that even in the best scenario, we are a full 2x the number of cycles that an Intel processor needs. That’s not good.
What is troubling is that the basic_decoder runs at 1 instruction per cycle or less. So there is contention for something and it is not strictly just data dependency…
Is this the best aarch64 can do for a population count?
https://godbolt.org/z/7KtXbC
This looks expensive… recent AMD x64 processors do the same with one instruction that can be executed 4 times per cycle.
Yes that’s the right sequence given there is no integer instruction. So it’s important not to use popcount on AArch64 as if it is really cheap.
I get less than 30 cycles per word on basic_decoder, which is faster than the x86 core you used. It may well be that branch prediction is the main performance limiter for sparse bitsets like this. So I still believe it’s fastest to decode on demand rather than like this.
On a different processor, I presume? Because I’d be pretty happy with 30 cycles per word … but I clearly do not get close to this (all my results are posted in my repo).