Daniel Lemire's blog

, 2 min read

Trading compression for speed with vectorization

Bitmap indexes are used by search engines (such as Apache Lucene), they are available in DBMSes such as Oracle and PostgreSQL. They are used in column stores such as the Open Source engines Eigenbase and C-Store, as well as by many commercial solutions such as Vertica.

Bitmap indexes are silly data structures. Map each value to an array of booleans. Hence, if you have n rows in your table, and k distinct values, you get an n by k matrix containing booleans. Thus, some people falsely assume that bitmap indexes are only adequate when there are few distinct values (e.g., the gender column, male and female being the only two options). However—using techniques based on run-length encoding—the total size of your bitmaps is proportional to the size of the original table, irrespective of the number of distinct values!

Bitmap indexes are fast because they benefit from vectorization. Indeed, let the predicate “sex=male” is satisfied on rows 1, 5, 32, 45, 54 and 63. I can determine which rows satisfy the extended predicate “(sex=male) AND (city=Montreal)” using a single instruction! The secret? A bitwise AND between the bitmaps “sex=male” and “city=Montreal”. You can compute unions, differences and intersections between sets of integers in [1,N] using only N/64 operations. All microprocessors have built-in parallelism because they operate on several bits at once.

To benefit from vectorization, you need to store the data in a word-aligned manner: that is, you store consecutive segments of bits uncompressed. The longer the words, the less compression. Roughly speaking, 64-bit bitmap indexes are nearly twice as large as 32-bit bitmap indexes. What is the effect on the processing speed? We found that despite being much larger, 64-bit bitmap indexes were faster. That is right: it was faster to load twice as much data from disk!

Yet, we often equate concise data structures with more speed. This assumption can be misguided. Given a choice between more compression, or more vectorization, I would choose more vectorization.

References:

Further reading: See my posts Compressed bitmaps in Java, To improve your indexes: sort your tables!, and The mythical bitmap index.