Daniel Lemire's blog

, 6 min read

Fast unary decoding

9 thoughts on “Fast unary decoding”

  1. JS says:

    See also __builtin_ctzll, _mm_tzcnt_64 and friends.

  2. KWillets says:

    This looks like some fun bit-twiddling.

    For the table-based method, would it work if for each byte you just store the lsb position and the byte without the lsb?

    Then you just look up the lsb position, replace the looked-up operand with its lsb-less version, and iterate until 0.

  3. @KWillets

    Part of my motivation for posting this here is to encourage others to try their own designs. There are many other strategies.

    I think that your approach would use less memory, but I think it would be slower.

  4. @JS

    I think that the instruction you want is popcnt with the corresponding _mm_popcnt_u64 intrinsic. I believe that other instructions are likely to give you slower code.

    Update: This conjecture was false.

  5. Anonymous says:

    @Daniel,

    actually, you may be wrong. I tried the suggested function and it was always faster. On my laptop the difference is small, but it is more dramatic on fastpfor for some reason.

    Try this implementation:

    int bitscanunary_ctzl(long *bitmap, int bitmapsize, int *out) {
    int pos = 0;
    int val = 0, newval = 0;
    for(int k = 0; k < bitmapsize; ++k) {
    unsigned long bitset = bitmap[k];
    while (bitset != 0) {
    long t = bitset & -bitset;
    int r = __builtin_ctzl(bitset);
    newval = k * 64 + r;
    out[pos++] = newval – val;
    val = newval;
    bitset ^= t;
    }
    }
    return pos;
    }

  6. @Anonymous

    Indeed, the approach __builtin_ctzl can be faster. I was wrong. I have updated my blog post.

  7. Bulat Ziganshin says:

    I think that any possible implementation would stall a lot due to instruction dependencies. by running multiple decoders in parallel (in the same OS thread) you can make it much faster. the same may apply to your turbopfor library

  8. Stan says:

    The graph seems strange – I expect that the lookup table approach will take a constant time regardless of the number of set bits in a byte.

    1. Decoding speed does depend on the density. Why would it not?

      This being said, the code is available. I invite you to review the code and run it for yourself if you’d like.