@Pagh I agree that your paper, which I read when it first came out on arxiv, is very interesting.
Of course, we often using fixed-length counters in actual databases since delta codes, such as the one you use, may require many CPU operations to read a single number. Trading space for fewer CPU operations is often worth it. So it is not entirely clear how your data structure would hold out on our current breed of CPUs.
(This is not to be taken as a criticism of your work, which I found to be excellent.)
This is an interesting question, both in practice and theory. I cannot resist pointing out a related paper by Srinivasa Rao and myself that appeared at PODS 2009:
It is an extension of previous work by Sinha and Winslett, and suggests a data structure that combines the best properties of bitmap indexes and B-trees, with particular focus on range queries. I would be very interested in seeing an engineering of these ideas, with experiments – currently we are missing the resources to do this kind of things ourselves…
You are totally right. WAH is an update to BBC. Conceptually, it is the same idea. (But you could object that BBC is just an update on the classic run-length encoding techniques.) My own work is also merely an update, in this sense, except that I considered very carefully table sorting as a factor (http://arxiv.org/abs/0901.3751 ).
@Pagh Oh! And if you are thinking about Zukowski et al. paper with Peter Boncz on Super-Scalar RAM-CPU Cache Compression, then no, their technique does not involve any form of run-length encoding. It is mostly dictionary coding. So it is probably not applicable to bitmaps.
@Daniel I agree completely that the compression/decompression of bitmaps is likely to be the practical bottleneck of what we propose. We crucially rely on compression that is close to the best information-theoretical bounds. I know that Peter Boncz and co-authors have some extremely fast compression methods, but I never got around to seeing if they compress 0-1 sequences optimally. Do you know if people tried to use multi-cores to be able to do better compression?
I worked on the bitmap code at Oracle and did my best to fix the folklore internally. But change is hard.
Gennady Antoshenkov, the person who provided BBC, did a remarkable job. I read a few of the WAH papers and my impression is that they were more of an update of BBC for modern computer architecture than something new. But maybe I am biased given my appreciation for the work of Antoshenkov.
@Pagh I agree that your paper, which I read when it first came out on arxiv, is very interesting.
Of course, we often using fixed-length counters in actual databases since delta codes, such as the one you use, may require many CPU operations to read a single number. Trading space for fewer CPU operations is often worth it. So it is not entirely clear how your data structure would hold out on our current breed of CPUs.
(This is not to be taken as a criticism of your work, which I found to be excellent.)
@Pagh I’ll answer your comment with my next blog post.
This is an interesting question, both in practice and theory. I cannot resist pointing out a related paper by Srinivasa Rao and myself that appeared at PODS 2009:
Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
It is an extension of previous work by Sinha and Winslett, and suggests a data structure that combines the best properties of bitmap indexes and B-trees, with particular focus on range queries. I would be very interested in seeing an engineering of these ideas, with experiments – currently we are missing the resources to do this kind of things ourselves…
@Callaghan
You are totally right. WAH is an update to BBC. Conceptually, it is the same idea. (But you could object that BBC is just an update on the classic run-length encoding techniques.) My own work is also merely an update, in this sense, except that I considered very carefully table sorting as a factor (http://arxiv.org/abs/0901.3751 ).
@Pagh Oh! And if you are thinking about Zukowski et al. paper with Peter Boncz on Super-Scalar RAM-CPU Cache Compression, then no, their technique does not involve any form of run-length encoding. It is mostly dictionary coding. So it is probably not applicable to bitmaps.
@Daniel I agree completely that the compression/decompression of bitmaps is likely to be the practical bottleneck of what we propose. We crucially rely on compression that is close to the best information-theoretical bounds. I know that Peter Boncz and co-authors have some extremely fast compression methods, but I never got around to seeing if they compress 0-1 sequences optimally. Do you know if people tried to use multi-cores to be able to do better compression?
The folklore even exists in Oracle docs, although the Data Warehousing guide gets it right — http://download.oracle.com/docs/cd/E11882_01/server.112/e10810/indexes.htm#DWHSG006. By ‘right’ I mean that they state the important factor is the ratio of distinct keys to the number of rows.
I worked on the bitmap code at Oracle and did my best to fix the folklore internally. But change is hard.
Gennady Antoshenkov, the person who provided BBC, did a remarkable job. I read a few of the WAH papers and my impression is that they were more of an update of BBC for modern computer architecture than something new. But maybe I am biased given my appreciation for the work of Antoshenkov.