Daniel Lemire's blog

, 32 min read

Really fast bitset decoding for “average” densities

35 thoughts on “Really fast bitset decoding for “average” densities”

  1. wmu says:

    You can make the code a bit faster on GCC 8, by using alternative C construction. If you introduce an auxiliary “uint32_t* val = base_ptr + base” and then each update will be like “*val++ = static_cast(idx) + trailingzeroes(bits);” then the compiler emits a slightly better machine code.

    Before: instructions per cycle 2.57, cycles per value set: 3.797
    After: instructions per cycle 2.45, cycles per value set: 3.428

    10% for free, not that bad 🙂

  2. Wilco says:

    You could do better by matching the number of bits processed to typical input data. Eg. unconditionally do 6 bits, do 4 more if there are still bits left, and then loop 2 bits at a time. This reduces branch misses as well as unnecessary work when there are no more set bits.

    However the most obvious way to speed this up further is to delay the decoding until you actually need it. This avoids the overhead of expanding the bitmap into a much larger data structure (max expansion is 64 times) and all associated cache overheads.

    Also the latency of computing the next bit will be completely hidden, unlike in these examples where the 2-cycle latency of x = x & (x – 1) is going to dominate (processing 2 64-bit masks in parallel may avoid this latency chain, but that makes things even more complex).

    1. Wilco says:

      Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set. Doing extra work doesn’t offset the reduction in branch misses (which are fast on Arm cores due to shallow pipelines). Changing faster_decoder to do 6 bits first, then 4 bits and finally loop on 1 bit is ~10% faster than fast_decoder and 20% than simdjson_decoder.

      Btw would it be reasonable to have a #if around #include <x86intrin.h> and evts.push_back(PERF_COUNT_HW_REF_CPU_CYCLES) so the code becomes more portable? That event fails on AArch64 kernels but this causes all other events to fail too…

      1. Thanks… I did not know that PERF_COUNT_HW_REF_CPU_CYCLES would fail on AArch64, thanks for pointing this out.

        Yes, evidently, the x86intrin header needs to be guarded with appropriate checks.

      2. Note on a modern AArch64 core the basic decoder is fastest by default since it has the minimum number of instructions per bit set.

        Sorry if I took some time to get back to you on this. So I verify this result initially, on an older compiler, but after reading up that the latest GNU GCC improved code generation improved I tried with GNU GCC 9… The results appear to be equivalent to what I get on x64, thus apparently contradicting your statement…

        I post my results as a MarkDown file…

        https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/RESULTS_AARCH64_AMPERE.md

        Furthermore, you will find a dockerfile to ease reproduction.

        1. Wilco says:

          Well you found a GCC9 bug! It incorrectly adds a useless popcount after the basic_decoder loop:

          mov x0, x4
          .L2758:
          rbit x1, x0
          clz x1, x1
          add w1, w3, w1
          str w1, [x19, w2, uxtw 2]
          sub x1, x0, #1
          add w2, w2, 1
          ands x0, x0, x1
          bne .L2758
          fmov d0, x4
          cnt v0.8b, v0.8b
          addv b0, v0.8b
          umov w0, v0.b[0]
          add w20, w20, w0

          That adds extra latency and 6 instructions, slowing basic_decoder.

          1. Interesting.

            1. So I did additional testing.

              fast_decoder:
              GCC8 => 8.8 cycles
              CLANG8 => 8.7 cycles
              GCC9 => 8.7 cycles

              simdjson_decoder:
              GCC8 => 12.6 cycles
              CLANG8 => 9.6 cycles
              GCC9 => 9.7 cycles

              basic_decoder:
              GCC8 => 8.5 cycles
              CLANG8 => 11.8 cycles
              GCC9 => 11.8 cycles

              I have updated the results in an markdown file in my repo (with the AMPERE string in the name). To ensure reproducibility, I have posted my scripts and docker files.

              It seems that LLVM 8 compiles basic_decoder to the following…

              .globl _Z13basic_decoderPjRjjm // — Begin function _Z13basic_decoderPjRjjm
              .p2align 2
              .type _Z13basic_decoderPjRjjm,@function
              _Z13basic_decoderPjRjjm: // @_Z13basic_decoderPjRjjm
              // %bb.0:
              cbz x3, .LBB4_3
              // %bb.1:
              ldr w8, [x1]
              .LBB4_2: // =>This Inner Loop Header: Depth=1
              rbit x9, x3
              clz x9, x9
              add w9, w9, w2
              str w9, [x0, w8, uxtw #2]
              ldr w8, [x1]
              sub x9, x3, #1 // =1
              ands x3, x9, x3
              add w8, w8, #1 // =1
              str w8, [x1]
              b.ne .LBB4_2
              .LBB4_3:
              ret
              .Lfunc_end4:
              .size _Z13basic_decoderPjRjjm, .Lfunc_end4-_Z13basic_decoderPjRjjm
              // — End function

              As for the basic_decoder function under GCC-9, it compiles to…

              .global _Z13basic_decoderPjRjjm
              .type _Z13basic_decoderPjRjjm, %function
              _Z13basic_decoderPjRjjm:
              .LFB1985:
              .cfi_startproc
              // bitmapdecoding.cpp:86: while (bits != 0) {
              cbz x3, .L2222 // bits,
              ldr w4, [x1] //, *base_15(D)
              .p2align 3,,7
              .L2225:
              // bitmapdecoding.cpp:26: return __builtin_ctzll(input_num);
              rbit x5, x3 // tmp102, bits
              clz x5, x5 // tmp102, tmp102
              // bitmapdecoding.cpp:87: base_ptr[base] = static_cast(idx) + trailingzeroes(bits);
              add w5, w2, w5 // tmp105, idx, tmp102
              // bitmapdecoding.cpp:87: base_ptr[base] = static_cast
              (idx) + trailingzeroes(bits);
              str w5, [x0, w4, uxtw 2] // tmp105, *_5
              // bitmapdecoding.cpp:88: bits = bits & (bits – 1);
              sub x4, x3, #1 // _7, bits,
              // bitmapdecoding.cpp:86: while (bits != 0) {
              ands x3, x3, x4 // bits, bits, _7
              // bitmapdecoding.cpp:89: base++;
              ldr w4, [x1] //, *base_15(D)
              add w4, w4, 1 // _9, *base_15(D),
              str w4, [x1] // _9, *base_15(D)
              // bitmapdecoding.cpp:86: while (bits != 0) {
              bne .L2225 //,
              .L2222:
              // bitmapdecoding.cpp:91: }
              ret

              1. It is hard for me to understand your concern with the 2-cycle dependency. We are using, in the best case, 8.5 cycles per set bit… far more than 2 cycles.

                Note that even in the best scenario, we are a full 2x the number of cycles that an Intel processor needs. That’s not good.

                What is troubling is that the basic_decoder runs at 1 instruction per cycle or less. So there is contention for something and it is not strictly just data dependency…

                1. Is this the best aarch64 can do for a population count?

                  https://godbolt.org/z/7KtXbC

                  This looks expensive… recent AMD x64 processors do the same with one instruction that can be executed 4 times per cycle.

                  1. Wilco says:

                    Yes that’s the right sequence given there is no integer instruction. So it’s important not to use popcount on AArch64 as if it is really cheap.

                    I get less than 30 cycles per word on basic_decoder, which is faster than the x86 core you used. It may well be that branch prediction is the main performance limiter for sparse bitsets like this. So I still believe it’s fastest to decode on demand rather than like this.

                    1. I get less than 30 cycles per word on basic_decoder

                      On a different processor, I presume? Because I’d be pretty happy with 30 cycles per word … but I clearly do not get close to this (all my results are posted in my repo).

  • Wilco says:

    Yes definitely. It seems Cortex cores can predict this better, for example Cortex-A72 has fewer than 13K mispredictions for all tests, but your results show over 20K mispredicts on x86 and Ampere.

  • @Wilco

    Took my some time, but I built up an ARM box of my own.

    I confirm that the Cortex-A72 is close to Intel Skylake as far as cycles, and mispredicted branches. Sometimes Intel does better, sometimes the A72 does better. The differences are not large from what I see (that’s what you’d expect from competitive technologies).

    The Cortex-A72 is definitively superior on this benchmark than Ampere’s Skylark.

    It is still inferior to Intel in the end because it does not appear to be able to cram as many instructions per cycle…

  • Wilco says:

    So you can reproduce the slowdown of basic_decoder. Note you’re quoting the non-inline version of basic_decoder rather than the main loop where it is inlined, which is where GCC9 adds the redundant popcount.

    1. @Wilco

      I have run the test on Apple’s A12 using the current version of Xcode…

      https://github.com/lemire/iosbitmapdecoding

      The numbers do not make much sense: they would imply that Apple’s A12 is highly inefficient.

      1. The opposite is true: the A12 is highly effective on a per cycle basis: https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/

  • Wilco says:
    1. Excellent. We need good code generation.

  • Jörn Engel says:

    A similar problem comes up when looking for certain bytes or patterns. Best approach is to use vector-comparisons and movemask, resulting in bitmaps and essentially your problem.

    One idea I have yet to try is to sort bitmap-words based on popcount. Processing the sorted bitmap-words can completely eliminate the mispredicted branches. You have to do more memory operations, which will partially offset the misprediction savings. But more importantly the algorithm gets rather complicated, so you are trading your own time and sanity for runtime performance. So far I couldn’t bring myself to make that trade.

    1. Travis Downs says:

      I have tried it and at least for my use cases it didn’t turn out faster. The sorting ate up my savings (I used a type of radix sort which I think was basically ideal for this scenario).

      However, as I recall by baseline was quite a bit faster than the ~4 cycles shown here: I think it was closer to 1.5 cycles.

      1. @Travis @Jörn

        I did implement what you describe. It is in the code but not covered by the blog post…

        https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/bitmapdecoding.cpp#L304

  • Zach Bjornson says:

    This can be done branchlessly with AVX2+BMI2 if you use the technique described in https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask/36951611#36951611, using an array of sequential numbers as the src to the “compress” function.

    With AVX512 you can do an entire 64-bit integer at once with a single instruction,
    vcompressps (_mm512_mask_compresstoreu_epi8(dest, bitset, rangeOf0Through64).

    1. Zach Bjornson says:

      (With AVX512_VBMI2*, not AVX512F.)

      And now I see something like this mentioned in one of the further reading links :).

    2. I think that I allude to the vectorization option in my post, with the important caveat:

      If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.

      So I think vector instructions will have trouble coping with the kind of data I am considering in this post.

      This is not to say that you cannot do it “branchlessly”: you can. You can bring down cache misses to almost zero. The trick is to do that while also not increasing instruction count too much.

  • -.- says:

    How well does a switch statement work? e.g.

    int count = popcnt(word);
    int* resultEnd = result + count;
    switch(count) {
    case 64: case 63: // ...etc
    for(int i=-count+4; i; i++) {
    resultEnd[i-4] = trailingzeroes(word);
    word &= word-1;
    }
    case 4: // add more cases if desired
    resultEnd[-4] = trailingzeroes(word);
    word &= word-1;
    case 3:
    resultEnd[-3] = trailingzeroes(word);
    word &= word-1;
    case 2:
    resultEnd[-2] = trailingzeroes(word);
    word &= word-1;
    case 1:
    resultEnd[-1] = trailingzeroes(word);
    case 0:
    }

    This could reduce “wasted” actions. If the exact ordering doesn’t matter, you could also skip the resultEnd calculation.

    Also, the initial branch may be expensive in this case if the number of bits set isn’t always the same. You could probably do some hybrid approach where you do a more coarse grained switch if the number of bits set happens to often fall within a particular range, e.g.:

    int count4 = popcnt(word) & -4;
    // example for granularity = 4, change values if more appropriate
    int* resultEnd = result + count4;
    switch(count4) {
    case 60: case 56: // ...etc
    // do some loop
    case 8:
    resultEnd[-8] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-7] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-6] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-5] = trailingzeroes(word);
    word &= word-1;
    case 4:
    resultEnd[-4] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-3] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-2] = trailingzeroes(word);
    word &= word-1;
    resultEnd[-1] = trailingzeroes(word);
    word &= word-1;
    case 0:
    resultEnd[0] = trailingzeroes(word);
    word &= word-1;
    resultEnd[1] = trailingzeroes(word);
    word &= word-1;
    resultEnd[2] = trailingzeroes(word);
    }

    Although this approach does make it much closer to what you already have.

    Depending on the bits set, a fast SIMD+lookup approach may still be reasonable even if there’s a smallish number of bits typically set, e.g.

    int wordPart16 = (word & 0xffff) << 4;
    _mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
    result += _mm_popcnt_u32(wordPart16);
    wordPart16 = (word>>12) & 0xffff0;
    _mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
    result += _mm_popcnt_u32(wordPart16);
    wordPart16 = (word>>28) & 0xffff0;
    _mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
    result += _mm_popcnt_u32(wordPart16);
    wordPart16 = (word>>44) & 0xffff0;
    _mm_storeu_si128(result, _mm_load_si128((char*)table + wordPart16));
    //result += _mm_popcnt_u32(wordPart16); // if needed

    This may still be competitive if there’s only 6 bits per 64-bit word.

    1. I have tried both of these techniques. The vectorized decoding is already part of CRoaring https://github.com/RoaringBitmap/CRoaring/blob/master/src/bitset_util.c#L553

      It works and this library is in production in some large corporations.

      However, I have not managed to make it competitive in the average-density scenario.

      I tried the switch case approach but I could not make it run faster though I must admit that I did not do much more than just code some C and record timings.

      My benchmark is rather simple: have you tried modifying it to test your ideas? Get in touch if you get promising results.

      1. -.- says:

        Ah, you’re using 32-bit indicies – that will seriously hurt a SIMD approach.
        On the flipside, if AVX512 is usable, you can use the VCOMPRESSD technique instead, which should perform very well.

        I suspect the naiive switch statement to perform better than the naiive loop approach. The performance of the jump could be problematic if there’s any unpredictable variation in the number of bits set.
        The coarse grained switch works better with variation of bits set, but is more similar to the simdjson decoder. Performance ultimately depends on the density characteristics I think; simdjson is probably slightly better if <8 bits are usually set, as it can avoid a jump.

        If it’s rare that there will be more than 1 bit set per 4, one could break it into nibbles and use PSHUFB to obtain indicies, which could perform okay. Looking at your sample data though, this isn’t the case, in fact, there seems to be a mix of densities, though they repeat regularly.

        It looks like your benchmark requires Linux perf counters, so must be run on a Linux baremetal install (doesn’t work in a Linux VM), which makes things a little tricky for me…

        1. -.- says:

          Well, wasn’t too far off what I had expected (I haven’t checked these for accuracy, so these are just rough indications). Figures are cycles per word:

          simdjson: 22.867
          basic: 33.117
          switch: 31.389
          switch4: 30.199
          switch8: 24.309
          switch16: 23.028
          vcompressd: 15.322

          1. Regarding the switch, I’d want to have a close look at how the compiler implements it. The compiler is free to turn a switch case into branching. I am surprised by the vcompressd results.

            Would you share your code? Maybe as a GitHub gist?

            1. -.- says:

              Yes, I confirmed that the switch is compiling to a jump, not a series of branches.

              The vcompressd method is fairly straightforward:

              #ifdef __AVX512F__
              static inline void vcompressd(uint32_t *base_ptr, uint32_t &base,
              uint32_t idx, uint64_t bits) {
              base_ptr += base;
              base += hamming(bits);

              __m512i indicies = _mm512_add_epi32(_mm512_set1_epi32(idx), _mm512_setr_epi32(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15));
              uint32_t* upper_ptr = base_ptr + _mm_popcnt_u32(bits);

              _mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
              indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
              // _mm_popcnt_u32 causes GCC to insert unnecessary MOVSX instructions (Clang not affected), so _mm_popcnt_u64 may be faster there
              base_ptr += _mm_popcnt_u32(bits & 0xffff);
              bits >>= 16;
              _mm512_mask_compressstoreu_epi32(base_ptr, bits, indicies);
              indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));

              bits >>= 16;
              _mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);
              indicies = _mm512_add_epi32(indicies, _mm512_set1_epi32(16));
              upper_ptr += _mm_popcnt_u32(bits & 0xffff);
              bits >>= 16;
              _mm512_mask_compressstoreu_epi32(upper_ptr, bits, indicies);

              //_mm256_zeroupper(); // automatically inserted by compiler; otherwise problematic if inlined and unrolled
              }
              #endif

              1. Thank you.

  • Samuel Lee says:

    Just stumbled across this – I’ll try and do some experimenting with your code because it is an intriguing problem.

    One small observation for targeting Aarch64 would be that given there is neither a count trailing zeros or an unset trailing zero instruction, repeating the construction:

    base_ptr[X] = static_cast<uint32_t>(idx) + trailingzeroes(bits);
    bits = bits & (bits - 1);

    requires at least 5 data processing instructions:

    rbit, clz, add (idx + leading_zero_count), sub (1), and

    We probably should instead reverse bits once upfront.
    Then we can explicitly count leading zeroes, and clear the most significant bit with 4 instructions:

    clz, add (idx + leading_zero_count), asr (arithmetic shift right int64_t_min by leading_zero_count), bic (clear bits up to an including most significant set bit)

    corresponding to C code something like:

    lz = leadingzeroes(rev_bits);
    base_ptr[X] = static_cast<uint32_t>(idx) + lz;
    rev_bits = rev_bits & ~(int64_t(0x1000000000000000) >> lz);

    Not tested this myself yet.

    1. Thank you.

      1. Implemented as follows:

        void basic_arm_decoder(uint32_t *base_ptr, uint32_t &base, uint32_t idx,
                           uint64_t bits) {
          uint64_t rev_bits;
          __asm volatile ("rbit %0, %1" : "=r" (rev_bits) : "r" (bits) );
          while (rev_bits != 0) {
            int lz = __builtin_clzll(rev_bits);
             base_ptr[base] = static_cast<uint32_t>(idx) + lz;
            rev_bits = rev_bits & ~(int64_t(0x8000000000000000) >> lz);
             base++;
          }
        }
        

        It does save quite a bit of instructions. Whether it is faster is less clear. On at least one ARM server I have, it is actually slower despite the instruction count.

        See my code and benchmark at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/05/03/bitmapdecoding.cpp