Daniel Lemire's blog

, 2 min read

Even faster bitmap decoding

Bitmaps are a simple data structure used to represent sets of integers. For example, you can represent all sets of integers in [0,64) using a single 64-bit integer. When they are applicable, bitmaps are very efficient compared to the alternatives (e.g., a hash set).

Unfortunately, extracting the bit sets in a bitmap can be expensive. Suppose I give you the integer 27 (written as 0b011011 in binary notation), you would want to recover the integers 0, 1, 3 and 4. Of course, you can check the value of each bit (by computing v & (1<<bit)) but this can be atrociously slow. To do it faster, you can use the fact that you can quickly find the least significant bit set:

int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {  
  long data = bitmaps[k];
  while (data != 0) {
    int ntz = Long.numberOfTrailingZeros(data);
    output[pos++] = k * 64 + ntz;
    data ^= (1l << ntz);
  }
}

In C or C++, the call to numberOfTrailingZeros can be mapped directly some assembly instructions (e.g., bit scan forward or BSF). Though it looks like numberOfTrailingZeros is implemented using several branches in Java, the compiler is smart enough to compile it down similar machine instructions. (Note: thanks to Erich Schubert for pointing out that Java is so smart.)

Up until recently, I did not think we could do better than relying on Long.numberOfTrailingZeros, but I stumbled on a blog post by Erling Ellingsen where he remarked that the function Long.bitCount (reporting the number of bit sets in a word) was essentially branch-free. (As it turns out, Java also converts bitCount to efficient machine instructions like it does for numberOfTrailingZeros.) This suggests an alternative to decode the set bits from a bitmap:

int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {
   long bitset = bitmaps[k];
   while (bitset != 0) {
     long t = bitset & -bitset;
     output[pos++] = k * 64 +  Long.bitCount(t-1);
     bitset ^= t;
   }
}

Experimentally, I find that the approach based on Long.bitCount can be 10% slower when there are few bit sets. However, it can also be significantly faster (e.g., 30% or even 70% faster) in some instances. Here are the decoding speeds in millions of integers per second on a recent Intel core i7 processor (in Java);

 Density   Naive   Long.numberOfTrailingZeros   Long.bitCount 
1/64 15 143 131
2/64 27 196 202
4/64 45 213 252
8/64 68 261 350
16/64 90 281 431
32/64 115 285 480

My code is available.

On the whole, the approach based on Long.bitCount is probably better unless you expect very sparse bitmaps.

Update: I have created a C++ version and in this case, both techniques have the same speed. So Java particularly loves bitCount for this problem.