Daniel Lemire's blog

, 10 min read

When is a bitmap faster than an integer list?

13 thoughts on “When is a bitmap faster than an integer list?”

  1. Itman says:

    There are some downsides to the bitmap approach: you first have to construct the bitmaps and then you have to extract the set bits.

    If you do a bunch of intersections, you end up with few non-zero bits/bytes. Hence, extraction may be quite cheap.

    PS: Is JavaEH a sparse bitmap?

  2. @Itman

    1) Bitmap decoding should be cheap compared to the cost of computing an intersection, but it is still a significant overhead. Thankfully, you can reduce this overhead with fast bitmap decoding algorithms (see http://lemire.me/blog/archives/2012/05/21/fast-bitmap-decoding/).

    2) Yes, JavaEWAH supports sparse bitmaps. There is also a C++ counterpart (https://github.com/lemire/EWAHBoolArray) but JavaEWAH is more popular.

  3. KWillets says:

    A good question to ask is how few bits we have to examine to construct the intersection of 2 or more vectors, given any choice of encoding for the vectors (without pre-storing intersections, etc.).

    I looked at this problem some years ago and decided that (delta and) Golomb-Rice coding could be rearranged into a sort of bucketed bitmap arrangement, where the range is broken into buckets of M consecutive int’s, represented in a bitmap (corresponding the unary parts of the coding), and an array of offsets within each occupied bucket, encoded in truncated binary as in the original.

    For example, a sorted vector of int’s in the range 0-127, with a Golomb-Rice bucket size of 8, would be broken into a 16-bit bitmap (buckets 0-7,8-15, …,120-127), and an array of offsets r < 8 within those buckets, in truncated binary (since a bucket can contain more than one value, the offsets also need a bit to indicate "consecutive in the same bucket" or some such).

    Calculating the intersection is a two-step process of first and-ing the bitmaps, and then fetching and comparing offsets in any bucket where a 1 remains. Since the offsets may be slow to fetch, the idea is to use the bitmaps as much as possible.

    The optimality of Golomb-Rice means the average bitmap will be about 50% sparse, and n-way intersections can get 2^-n sparsity before any bucket contents need to be examined.

    I never decided on the best way to store the offsets, but this seemed like a good way to size the summary bitmaps.

    1. Tobin Baker says:

      You have basically just described Elias-Fano coding, which is provably optimal. See these slides for a nice intro: http://shonan.nii.ac.jp/seminar/029/wp-content/uploads/sites/12/2013/07/Sebastiano_Shonan.pdf

  4. @KWillets

    It is a reasonable approach. Have you tried implementing it?

  5. Ajinkya Kale says:

    I just love the way you treat simple problems into fun analysis!

  6. KWillets says:

    No, I did some fiddling with it, but my employer at the time wasn’t interested, and I didn’t have the time to flesh it out fully. It might be fun to code up a simple benchmark.

  7. @KWillets

    Get in touch with me if you do.

  8. KWillets says:

    It might be worth a shot.

    It’s also easier in Rice coding, because the offsets/remainders are fixed width.

  9. Clearing the sparse bitmap can take a while and that’s a pain if it’s a frequent operation. Russ Cox gives a nice explanation of an old technique for using uninitialised memory for a sparse set that allows a fast clear with slightly higher cost to the set and test operations. http://research.swtch.com/sparse

  10. Calum says:

    I read somewhere that the most efficient data structure to represent a deck of playing cards is actually a bitmap, perhaps organised as 4*16-bit numbers representing the suits. Then it becomes much simpler/faster to detect various hands using bitwise operations.

    This is most useful in AI where a lot of eventualities need to be computed quickly.

  11. Maxime Caron says:

    Any idea how EWAH compare in speed or size to “COmpressed ‘N’ Composable Integer SET”
    see: http://ricerca.mat.uniroma3.it/users/colanton/concise.html

    I am trying to heuristic that would let a database engine decide which implementation to use.

  12. @Maxime

    I have written a benchmark to compare the various bitmap libraries, including Concise.

    Please go to

    https://github.com/lemire/simplebitmapbenchmark

    If you are interested in the results I get on my machine, you can browse:

    https://github.com/lemire/simplebitmapbenchmark/blob/master/RESULTS

    As usual, there are trade-offs.

    If you want to run the test yourself, grab a copy and execute the bash script “run.sh” (or the Windows equivalent).

    If you want to hack the code or add your own benchmarks, I’ll be very interested in any contribution.