23rd December 2013, 11 min read

Even faster bitmap decoding

13 thoughts on “Even faster bitmap decoding”

KWillets says:

December 23, 2013 at 1:04 pm

For the lab, why not just use a table for the rightmost 8 bits or so? Something like

while(val) {
lsb = table[val&255];
if( lsb < 9 ) {
output lsb-1;
val >>= lsb;
} else val >>= 8;
}

Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.
Daniel Lemire says:

December 23, 2013 at 2:00 pm

@KWillets

I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.

I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.
KWillets says:

December 23, 2013 at 3:41 pm

Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.

The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.

Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.
Daniel Lemire says:

December 23, 2013 at 4:19 pm

@KWillets

Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.

If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.

The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.
KWillets says:

December 23, 2013 at 7:07 pm

This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.

Also, forgot crucial variable in my earlier sketch :(.

void bits( int v, int *out, int *nout) {

// lowest-order 1 positions: 16-values, 2 bits each
int lsbs = (0<<0) + (0<<2) + (1<<4) + (0<<6) + (2<<8) + (0<<10) + (1<<12) + (0<<14)
+ (3<<16) + (0<<18) + (1<<20) + (0<<22) + (2<<24) + (0<<26) + (1<<28) + (0<<30)
;

int shift = 0;

while(v) {
int mask = v & 15;
int lsb;
if( mask ) {
lsb = (lsbs>>( mask<<1 )) & 3;
out[*nout] = shift + lsb;
*nout += 1;
}
else
lsb = 3;

v >>= (lsb + 1);
shift += (lsb + 1);
}

}
Leonid Boytsov says:

December 26, 2013 at 10:42 pm

I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.

It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.

BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):

https://github.com/searchivarius/BlogCode/tree/master/2013/12/StupidMemoryBitscan
Leonid Boytsov says:

December 27, 2013 at 10:30 am

PS: Sorry, it wasn’t naive, it was bitcan1, looked at the wrong column in the output.
KWillets says:

December 27, 2013 at 11:21 am

I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.
Daniel Lemire says:

December 27, 2013 at 11:30 am

@KWillets

I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.
Jeff Plaisance says:

December 30, 2013 at 4:23 pm

What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.

Daniel Lemire says:

December 30, 2013 at 4:40 pm

@Plaisance

The benchmarks were with Java 7 on a recent Intel core i7 processor.

Here is my exact output:

1 15.721772394850259 130.7076604166934 108.14306849412094 142.57966656658982 47.07466682521651 109.05387639509533
1 15.719865221400806 130.75023263556812 108.23983388274465 142.48395929541155 47.055448709307086 109.01093059260431
2 27.581771325981062 203.46921947232224 144.26321503767772 196.22725379809336 54.640605791918645 174.86125467743133
2 27.580898096812284 203.41295240307974 144.517990570322 196.2934226766008 54.61655047032696 174.79773330962325
4 46.18731765151476 278.1934945975009 178.2828904319236 210.16702578352476 62.038651431607924 264.8420813395451
4 46.18743343030088 278.1961657021847 178.40839167364427 210.06200874486856 62.01685890854171 265.122615285776
8 69.45215572862662 369.4220746581799 199.97836654758095 276.3127587215588 76.64864861923547 343.31817552020885
8 69.47323783413718 369.0569650016884 199.97733460115012 276.35278123012506 76.73822955437699 343.1901170217992
16 91.62453922232 455.65586979005445 212.64403795214915 341.16565104787577 98.03713889055986 408.3932414941178
16 91.6183671298949 455.4029909131834 212.6857679140175 340.7542524764352 98.00478438937044 408.22104217767463

Jeff Plaisance says:

December 30, 2013 at 4:52 pm

Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.
Claus Larsen says:

March 1, 2018 at 12:29 pm

I realize that this post is a few years old, but I stumbled upon it in my quest for a faster nextSetBit(int fromIndex) on the Java BitSet.

Actually, my quest was for a faster way to do:

for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) {

And I wanted to share my new way of doing this, which on a dense set is about 4-5 times faster than the for loop above and on sets with about 50% density about 3 times faster. Results may vary, I have only done some simple testing.

interface BitSetCallback { void nextSetBit(int i); } public void nextSetBitCallBack(BitSetCallback cb) { long[] words = this.words; int wordsInUse = words.length; int u = 0; int indexCounter = 0; while (u < wordsInUse) { long word = words[u]; while (word != 0) { long idx = word & 1; while (idx == 0) { word >>>= 1; ++indexCounter; idx = word & 1; } cb.nextSetBit(indexCounter); word >>>= 1; ++indexCounter; } ++u; } }