23rd October 2012, 10 min read

When is a bitmap faster than an integer list?

13 thoughts on “When is a bitmap faster than an integer list?”

Itman says:

October 23, 2012 at 10:37 am

There are some downsides to the bitmap approach: you first have to construct the bitmaps and then you have to extract the set bits.

If you do a bunch of intersections, you end up with few non-zero bits/bytes. Hence, extraction may be quite cheap.

PS: Is JavaEH a sparse bitmap?
Daniel Lemire says:

October 23, 2012 at 10:59 am

@Itman

1) Bitmap decoding should be cheap compared to the cost of computing an intersection, but it is still a significant overhead. Thankfully, you can reduce this overhead with fast bitmap decoding algorithms (see http://lemire.me/blog/archives/2012/05/21/fast-bitmap-decoding/).

2) Yes, JavaEWAH supports sparse bitmaps. There is also a C++ counterpart (https://github.com/lemire/EWAHBoolArray) but JavaEWAH is more popular.
KWillets says:

October 23, 2012 at 1:35 pm

A good question to ask is how few bits we have to examine to construct the intersection of 2 or more vectors, given any choice of encoding for the vectors (without pre-storing intersections, etc.).

I looked at this problem some years ago and decided that (delta and) Golomb-Rice coding could be rearranged into a sort of bucketed bitmap arrangement, where the range is broken into buckets of M consecutive int’s, represented in a bitmap (corresponding the unary parts of the coding), and an array of offsets within each occupied bucket, encoded in truncated binary as in the original.

For example, a sorted vector of int’s in the range 0-127, with a Golomb-Rice bucket size of 8, would be broken into a 16-bit bitmap (buckets 0-7,8-15, …,120-127), and an array of offsets r < 8 within those buckets, in truncated binary (since a bucket can contain more than one value, the offsets also need a bit to indicate "consecutive in the same bucket" or some such).

Calculating the intersection is a two-step process of first and-ing the bitmaps, and then fetching and comparing offsets in any bucket where a 1 remains. Since the offsets may be slow to fetch, the idea is to use the bitmaps as much as possible.

The optimality of Golomb-Rice means the average bitmap will be about 50% sparse, and n-way intersections can get 2^-n sparsity before any bucket contents need to be examined.

I never decided on the best way to store the offsets, but this seemed like a good way to size the summary bitmaps.
1. Tobin Baker says:
  
  March 4, 2017 at 9:57 pm
  
  You have basically just described Elias-Fano coding, which is provably optimal. See these slides for a nice intro: http://shonan.nii.ac.jp/seminar/029/wp-content/uploads/sites/12/2013/07/Sebastiano_Shonan.pdf
Daniel Lemire says:

October 23, 2012 at 1:49 pm

@KWillets

It is a reasonable approach. Have you tried implementing it?
Ajinkya Kale says:

October 23, 2012 at 3:13 pm

I just love the way you treat simple problems into fun analysis!
KWillets says:

October 23, 2012 at 3:27 pm

No, I did some fiddling with it, but my employer at the time wasn’t interested, and I didn’t have the time to flesh it out fully. It might be fun to code up a simple benchmark.
Daniel Lemire says:

October 23, 2012 at 3:51 pm

@KWillets

Get in touch with me if you do.
KWillets says:

October 23, 2012 at 6:44 pm

It might be worth a shot.

It’s also easier in Rice coding, because the offsets/remainders are fixed width.
Ralph Corderoy says:

October 24, 2012 at 6:29 am

Clearing the sparse bitmap can take a while and that’s a pain if it’s a frequent operation.â€‚Russ Cox gives a nice explanation of an old technique for using uninitialised memory for a sparse set that allows a fast clear with slightly higher cost to the set and test operations.â€‚http://research.swtch.com/sparse
Calum says:

October 25, 2012 at 6:37 am

I read somewhere that the most efficient data structure to represent a deck of playing cards is actually a bitmap, perhaps organised as 4*16-bit numbers representing the suits. Then it becomes much simpler/faster to detect various hands using bitwise operations.

This is most useful in AI where a lot of eventualities need to be computed quickly.
Maxime Caron says:

October 30, 2012 at 3:32 pm

Any idea how EWAH compare in speed or size to “COmpressed ‘N’ Composable Integer SET”
see: http://ricerca.mat.uniroma3.it/users/colanton/concise.html

I am trying to heuristic that would let a database engine decide which implementation to use.
Daniel Lemire says:

October 30, 2012 at 4:24 pm

@Maxime

I have written a benchmark to compare the various bitmap libraries, including Concise.

Please go to

https://github.com/lemire/simplebitmapbenchmark

If you are interested in the results I get on my machine, you can browse:

https://github.com/lemire/simplebitmapbenchmark/blob/master/RESULTS

As usual, there are trade-offs.

If you want to run the test yourself, grab a copy and execute the bash script “run.sh” (or the Windows equivalent).

If you want to hack the code or add your own benchmarks, I’ll be very interested in any contribution.