Daniel Lemire's blog

, 28 min read

The mythical bitmap index

31 thoughts on “The mythical bitmap index”

  1. Thank you Daniel. This is more than I wished for.

  2. Daniel, I’m glad your pushing this. If you check out the talk page for the entry (http://en.wikipedia.org/wiki/Talk:Bitmap_index), you’ll see that a couple of us are supporting your efforts.

    Fixing Wikipedia is a thankless but worthwhile task, and I’m glad to see researcher-bloggers taking it on. I’ve had a few success stories myself:

    http://thenoisychannel.blogspot.com/2008/06/your-input-really-is-relevant.html

    http://thenoisychannel.blogspot.com/2008/06/exploratory-search-is-relevant-too.html

  3. Ouch, make that “you’re”. Next time I’ll proofread before getting caught up in the Roman numeral arithmetic.

    And the talk page for the Wikipedia entry didn’t get turned into a hot link, so I’ll try again:

    http://en.wikipedia.org/wiki/Talk:Bitmap_index

  4. Ragib Hasan says:

    Hi Daniel, I am not sure why your edit got reverted (can’t find the diff at a glance), but Wikipedia is pretty much open to referenced information. So, if almost any info is properly referenced from several reliable sources, it gets to stay.

    I have added the info and reference to the Sharma benchmark article, and feel free to provide further links/refs about this in the article talk page.

  5. Thank you. I saw that earlier today.

  6. Chris Dew says:

    If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?

  7. Will says:

    Daniel,

    I’m installing the software today, and don’t have the GNU seq command installed; so the /testequalityqueries.sh failed.

    Let me suggest you replace `seq 0 10` with
    0 1 2 3 4 5 6 7 8 9 10 in the two places seq is called in the script. This was then successful.

  8. Thank you Will.

  9. If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?

    As I point out in my post, you typically use a b-tree to get the location of the bitmap. You need some sort of b-tree-like data structure because compressed bitmaps have different sizes so you can’t just have them in a fix-length-record flat file.

    It is not very exciting to index a field where all values are distinct because, yes, the bitmaps would all be very short. Ah! But you rarely have a single dimension. Bitmap indexes are typically used in a DSS context where you “always” have numerous dimensions.

  10. John Haugeland says:

    Jesus, you’re surprised that Wikipedia is handing out bad technical advice? Have you noticed who writes it yet?

    http://sc.tri-bit.com/outgoing/NeverTrustWikipediaTechnicalArticles.png

    Also, your anti-spam mechanism is inappropriately case sensitive.

  11. Parand says:

    Thanks Daniel, I learned something today.

  12. I’m no expert, but it seems people who authored http://i.cs.hku.hk/~ssdbm/slides/SSDBM.July11/Session7.2.pdf are, and I think they claim the opposite.

  13. @Otis: Wu is an expert indeed. Please see his 2006 paper “Optimizing bitmap indices with efficient compression”

    I quote their abstract:

    ” In this paper, we present a new compression scheme called Word-Aligned Hybrid (WAH) code that makes com-
    pressed bitmap indices efficient even for high cardinality attributes. ”

    (WAH is based on BBC which is used by Oracle.)

    1. Anifat says:

      Though I am a student; I can confirm that I have tested this assumption in Oracle 11g. Bitmap index is efficient for high cardinality and very fast for evaluating aggregate queries.

  14. zhuo wang says:

    Can we say that WAH is the best of all kinds of bitmap indices? How does chan’s multi-component interval encoding method work compared to WAH?
    Will WAH be the overall winner? According to Wu’s paper, the WAH compression works well for sparse bitmaps,i.e.,the large cardinality, and skewed data. Maybe some other schemes will beat WAH in lower cardinality.

  15. The general idea behind WAH is sound and that’s what people should be using. Not necessarily WAH itself, but some variant. For example, we decided to use EWAH. (Note that WAH itself is patented.)

    Chan’s interval encoding is a bit of a problem because it will generate an incompressible index. If you have L different attribute values in your column, and n rows, you will end up with nL bits that are incompressible. This can be quite a pain. It will be slow to index, it will use a lot of storage, it will be hard to keep things in memory, and so on.

    I would not recommend interval coding. Rather, look at hierarchical coding with maybe 2-3 levels.

  16. zhuo wang says:

    where can I find the description of EWAH? What is hierarchical coding? where can I find it?

    Yes, results of interval encoding and such like will be hard to compress further, but they are not supposed to do that. The n*L bits with L distinct values and n rows are not interval encoding, they are just simple(basic) bitmap index. For interval encoding, with m components, let m = log_b_(L),it will produce m*(b/2)*n bits, far less than n*L bits.

  17. where can I find the description of EWAH?

    EWAH is not an important contribution, nevertheless… see this paper:

    Sorting improves word-aligned bitmap indexes

    See also my slides:

    http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them

    What is hierarchical coding? where can I find it?

    R. R. Sinha, M. Winslett, Multi-resolution bitmap indexes for scientific
    data, ACM Trans. Database Syst. 32 (3) (2007)

    Yes, results of interval encoding and such like will be hard to compress further, but they are not supposed to do that. The n*L bits with L distinct values and n rows are not interval encoding, they are just simple(basic) bitmap index. For interval encoding, with m components, let m = log_b_(L),it will produce m*(b/2)*n bits, far less than n*L bits.

    With RLE-based compression (including BBC, WAH, EWAH), compression accelerates query as well as reducing storage. The total storage will be O(n) with a small constant.

    Note that Chan and Ionnadis did not invent multicomponent indexing. It goes back to the seventies. What they stated is that if you don’t compress your bitmaps using RLE compression, then one level of multicomponent indexing was ideal. The problem with this analysis is that there is no reason not to compress the bitmap indexes with RLE.

    Then, they invented interval coding. The problem is that interval coding precludes compression!

    Yes, you can use multicomponents indexes… but you can use multicomponents with anything, not just interval coding. Is your factor m*b/2 small? It is actually m L^(1/m)/2. It is pretty hard to make it into a small constant.

    Think about the fact that if the index is ten times the size of the original table, this may be prohibitive. Indexes are not supposed to be much larger than the original data set… Very large indexes tend to be slow in practice due to buffering problems.

    But read what Sinha and Winslett have written on this topic. Wu has also written on why interval coding is not a good idea.

  18. According to Wu’s paper, the WAH space complexity is 2N words for most datasets, 4N for the worst case. Note it is 2N words,that is 64N bits. My m*b/2*n is in bits!m and b are both very small, isn’t it? I can’t image m L^(m/2)/2, since L can be a very big number. Chan would not be so foolish to present such a solution. 🙂

    Chan is far from foolish. That is not my point. Interval coding is a nice idea, but it fails for non-trivial reasons. You only realize it is a bad idea after playing with real-world implementations of the idea.

    The formula is something like m L^(1/m)/2 and not m L^(m/2)/2. To make L^(m/2)/2 into a small constant, you need for m to be rather large (say m=5).

    Eventually, you will make interval coding to be just as small as a regular BBC, WAH or EWAH bitmap index, but what will happen of your query performance? Remember that every time you use the multicomponent trick, your performance degrades… and it does so faster than you might think! To see why, try to implement it! You will see that it gets nasty, and you end up having rather complicated boolean functions.

    So you will need to load many incompressible bitmaps. Hence, for any range query, the amount of data you will load is in Omega(n) because your bitmaps are incompressible. The only way this is ok is if your selectivity is always very low. But then, at that point, why not use a Bit-Sliced or projection indexes?

    There is absolutely no way you can get good practical performance, and nice fundamental properties, with incompressible bitmaps. Multicomponent is not a good substitute to compression because it trades off space for speed. You want to get better speed and less storage, both together! RLE compression does exactly this!

    BTW, how do you comment on Hakan’s Approximate bitmap index? I have developed this scheme into an 100% accurate one with 0 false positives.
    Do you think this is promising?

    Hard to tell. Send me a draft and I’ll give you an opinion. But a good sanity test is to implement your scheme and see how fast it is!

    I’m guessing you have a scheme that allows random access? Is it random access in constant time? Then what is the constant?

  19. zhuo wang says:

    Thank you for your links to those matierals…

    According to Wu’s paper, the WAH space complexity is 2N words for most datasets, 4N for the worst case. Note it is 2N words,that is 64N bits. My m*b/2*n is in bits!m and b are both very small, isn’t it? I can’t image m L^(m/2)/2, since L can be a very big number. Chan would not be so foolish to present such a solution. 🙂

    BTW, how do you comment on Hakan’s Approximate bitmap index? I have developed this scheme into an 100% accurate one with 0 false positives.
    Do you think this is promising?

  20. I didn’t mean interval encoding outperforms WAH, I agree that WAH or EWAH could be the best scheme for bitmap indexes. I have read Wu’s ACM paper(38 pages) and it really makes sense.

    If you have a high selectivity query, your effort should be small. But with interval coding, your effort is always proportional to the number of rows in the entire table. That is not good for high selectivity queries. And interval coding does not even fare all that well for low selectivity queries where I prefer projection and bit-sliced indexes. For one thing, it is far simpler to implement a projection index to interval coding!

    As for approximate bitmap(AB) index, I implemented it and improved it into a accurate scheme with a small space and time overhead. But experiments show that the AB scheme is hard to be pratical for it only suits for small query region(narrow row and col selectivity). This technique essentially doesn’t work for bitmap index. The problem is the heavy iterative cost for making (row+col) keys for hashing, which is far more time consuming compared to CPU bitwise operations as WAH exploits.

    It is unfortunately easy to underestimate the CPU bottleneck when working on paper. Researchers often write that since we have 4 cores or more per CPU, CPU cycles are cheap. But that is not so simple, is it?

  21. zhuo wang says:

    Thanks,daniel. 🙂
    I didn’t mean interval encoding outperforms WAH, I agree that WAH or EWAH could be the best scheme for bitmap indexes. I have read Wu’s ACM paper(38 pages) and it really makes sense.

    As for approximate bitmap(AB) index, I implemented it and improved it into a accurate scheme with a small space and time overhead. But experiments show that the AB scheme is hard to be pratical for it only suits for small query region(narrow row and col selectivity). This technique essentially doesn’t work for bitmap index. The problem is the heavy iterative cost for making (row+col) keys for hashing, which is far more time consuming compared to CPU bitwise operations as WAH exploits.

  22. @Lyle

    For an uncompressed bitmap, you are correct. But if each bitmap is compressed, the bitmap index will use O(n * log(n)) space.

    Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.

    http://arxiv.org/abs/0901.3751

  23. Lyle Kopnicky says:

    Maybe I’m not understanding bitmap indexes, but it seems to me that if you need 1 bit per possible value per record, it requires n^2 bits for n records when each record has a distinct value. The table itself requires only n*log(n) space to store the same values, so the bitmap index is bigger.

    Even If you only have two possible values, male and female, then your table requires one bit per record, but your index requires two bits per record!

  24. Kemal Erdogan says:

    Hi Daniel,

    Thank you very much for sharing your work on EWAH implementations. As far as I understand, bitmap indexes are expected to be and-ed or-ed, etc many times over. However, your implementation creates a new bitmap as the result of the operation. That would make it necessary to scan the indexes at least as many times as the number of logical predicates in a query. I was wondering whether this can be changed. Instead of returning a new bitmap you could return an EWAHIterator. And also accept and EWAHIterator as input. This would allow a single scan be done for the whole chain of logical operations. Could you please comment on this

    Regards,
    Kemal

  25. @Kemal Erdogan

    Thanks. Your comment is sound and could be addressed in one of several ways.

    I assume that you are using the Java version? If so, can you submit this either as an issue on github:

    https://github.com/lemire/javaewah

    or on

    http://code.google.com/p/javaewah/

    If you are using the C++, please add this as an issue on

    http://code.google.com/p/lemurbitmapindex/

  26. Lil says:

    Hi I was trying to use your c++ implementation, but when I try to do the make I got the following error

    headers/ewahutil.h:21:23: error: conflicting declaration ‘typedef long unsigned int uint32_t’
    /usr/include/stdint.h:52:23: error: ‘uint32_t’ has a previous declaration as ‘typedef unsigned int uint32_t’
    make: *** [unit] Error 1

    can yout tell me how to fix it.

  27. @Lil

    Please report issues using either github or preferably, Google code:

    http://code.google.com/p/lemurbitmapindex/issues/list

    Be sure to specify which compiler you are using.

  28. Jason Davies says:

    The Oracle link seems to be broken, the correct link is: http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html

  29. @Jason

    Thanks for the update.

  30. Lil says:

    Hi again, I am working with C++ library, and I want to store the compress bitmap hexadecimal representation. But still I don´t figured out how to do it. I was trying to use the compressed word iterator, but since I am not use to work in C++ is not very clear to me how to do it. Also I was using the method writeBuffer but does not write anything in the file, but I am not sure if I am using it correctly. Can you help me?