8th March 2018, 3 min read

Iterating over set bits quickly (SIMD edition)

Suppose that you have a long sequence of bits 10101011100000… you want to visit all the bits set to 1. That is, given 10101011100000, you would like to get the indexes of all the bits set to one: 0,2,4,6,7,8,9.

In a recent blog post, I reviewed fast techniques to iterate over the position of the bits set to 1 in such a bit stream. A fast function in C to solve the problem makes use of the trailing-zero instruction found in recent x64 processors and generated by the __builtin_ctzll intrinsic in several compilers such as LLVM’s clang and GNU gcc. The trailing-zero instruction gives us the number of consecutive zeroes present in a word starting from the least significant bit (so that odd integers have a number of trailing zeros equal to 0). The code is relatively elegant:

size_t bitmap_decode_ctz(uint64_t *bitmap, 
  size_t bitmapsize, 
  uint32_t *out) {
  size_t pos = 0;
  uint64_t bitset;
  for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    while (bitset != 0) {
      uint64_t t = bitset & -bitset;
      int r = __builtin_ctzll(bitset);
      out[pos++] = k * 64 + r;
      bitset ^= t;
    }
  }
  return pos;
}

In the comments, powturbo (it is a pseudonym) remarked that you could solve the same problem using SIMD instructions. SIMD instructions operate on vectors of words and are often faster. I have not looked at powturbo’s code, but I happen to have an implementation of my own, inherited from my friend Nathan Kurz. It basically processes bytes one by one, looking up the indexes in a table (8kB). Omitting the table, the code is just a bit longer than the trailing-zero version:

int bitmap_decode_avx2(uint64_t * array, 
  size_t sizeinwords, uint32_t *out) {
 uint32_t *initout = out;
 __m256i baseVec = _mm256_set1_epi32(-1);
 __m256i incVec = _mm256_set1_epi32(64);
 __m256i add8 = _mm256_set1_epi32(8);

 for (int i = 0; i < sizeinwords; ++i) {
  uint64_t w = array[i];
  if (w == 0) {
   baseVec = _mm256_add_epi32(baseVec, incVec);
  } else {
   for (int k = 0; k < 4; ++k) {
    uint8_t byteA = (uint8_t) w;
    uint8_t byteB = (uint8_t)(w >> 8);
    w >>= 16;
    __m256i vecA = _mm256_load_si256(vt[byteA]);
    __m256i vecB = _mm256_load_si256(vt[byteB]);
    uint8_t advanceA = lengthTable[byteA];
    uint8_t advanceB = lengthTable[byteB];
    vecA = _mm256_add_epi32(baseVec, vecA);
    baseVec = _mm256_add_epi32(baseVec, add8);
    vecB = _mm256_add_epi32(baseVec, vecB);
    baseVec = _mm256_add_epi32(baseVec, add8);
    _mm256_storeu_si256((__m256i *) out, vecA);
    out += advanceA;
    _mm256_storeu_si256((__m256i *) out, vecB);
    out += advanceB;
   }
  }
 }
 return out - initout;
}

(At the expense of some performance, you can use a smaller 2kB table.)

As usual, it looks like gibberish if you do not know about Intel intrinsics, but trust me: this code does nothing complicated. It is basically just memoization.

So how does the vectorized version (using SIMD instructions) does against the trailing-zero version? Here are approximate timings on a Skylake processor:

density	trailing-zero	SIMD
0.0625	~6.5 cycles per int	~6 cycles per int
0.125	~5 cycles per int	~3 cycles per int
0.25	~4 cycles per int	~2 cycles per int
0.5	~3 cycles per int	~1.25 cycles per int
0.9	~3 cycles per int	~0.4 cycles per int

For dense bitstreams, my timings are accurate, but they grow more and more inaccurate for sparser bitstreams. Nevertheless, the pattern is clear: if you have dense bitstreams, the SIMD code is about twice as fast. The gains are higher when the bitmaps are very dense. However, the gains disappear when the bit stream is sparse… I should point out that these timings are not directly comparable with those from my earlier blog post because I write the indexes to an array instead of calling a function for each index. Thus I am currently solving a slightly easier problem.

My code is available.

Note: The “compress” instructions in AVX-512 (e.g., vpcompressd) could make this problem easier. Sadly, AVX-512 instructions are not yet widely available on commodity processors.