21st February 2018, 2 min read

Iterating over set bits quickly

A common problem in my line of work is to iterate over the set bits (bits having value 1) in a large array.

My standard approach involves a “counting trailing zeroes” function. Given an integer, this function counts how many consecutive bits are zero starting from the less significant bits. Any odd integer has no “trailing zero”. Any even integer has at least one “trailing zero”, and so forth. Many compilers such as LLVM’s clang and GNU GCC have an intrinsic called __builtin_ctzl for this purpose. There are equivalent standard functions in Java (numberOfTrailingZeros), Go and so forth. There is a whole Wikipedia page dedicated to these functions, but recent x64 processors have a fast dedicated instruction so the implementation is taken care at the processor level.

The following function will call the function “callback” with the index of each set bit:

uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    while (bitset != 0) {
      uint64_t t = bitset & -bitset;
      int r = __builtin_ctzl(bitset);
      callback(k * 64 + r);
      bitset ^= t;
    }
}

The trick is that bitset & -bitset returns an integer having just the least significant bit of bitset turned on, all other bits are off. With this observation, you should be able to figure out why the routine work.

Note that your compiler can probably optimize bitset & -bitset to a single instruction on x64 processors. Java has an equivalent function called lowestOneBit.

If you are in a rush, that’s probably not how you’d program it. You would probably iterate through all bits, in this manner:

uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    size_t p = k * 64;
    while (bitset != 0) {
      if (bitset & 0x1) {
        callback(p);
      }
      bitset >>= 1;
      p += 1;
    }
}

Which is faster?

Obviously, you have to make sure that your code compiles down to the fast x64 trailing-zero instruction. If you do, then the trailing-zero approach is much faster.

I designed a benchmark where the callback function just adds the indexes together. The speed per decoded index will depend on the density (fraction of set bits). I ran my benchmark on Skylake processor:

density	trailing-zero	naive
0.125	~5 cycles per int	40 cycles per int
0.25	~3.5 cycles per int	30 cycles per int
0.5	~2.6 cycles per int	23 cycles per int

My code is available.

Thus using a fast trailing-zero function is about ten times faster.

Credit: This post was inspired by Wojciech Mula.