Daniel Lemire's blog

, 8 min read

More database compression means more speed? Right?

9 thoughts on “More database compression means more speed? Right?”

  1. Greg says:

    Google’s compression is probably worth a mention here too. They tend to use very fast, lightweight compression as well (e.g. Zippy).

    Some of it is described in their papers (a small section in the Bigtable paper) and talks (small mentions of it in Jeff Dean’s talks, such as his Bigtable talk or his recent LADIS 2009 talk).

  2. @Greg

    Thanks Greg. This is the kind of comment that makes blogging so profitable.

    Shame on me: I did not even think of BigTable, and I don’t know anything about their compression techniques… more reading for me… I hope to blog about it in the future.

  3. Neil Conway says:

    As we have more CPU cores, we also have more bandwidth to bring data to the the cores.

    There is no reason for that to be true: advances in processor technology often follow a different curve than advances in memory architectures. It may well be the case that many-core architectures in the future are increasingly bandwidth-constrained: once data reaches a core, computation cycles are cheap, but data movement into / out of cores might be relatively expensive.

  4. @Conway

    There reason for this to be true is stated in my post: Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications. Intel will not mass-produce CPUs unless the technology to keep them busy with mainstream applications is out there.

    Disclaimer: you can always find special cases, and nobody can predict the future.

  5. Neil Conway says:

    Well, it may well be the case that “CPU cores will be constantly data-starved” in the future, certainly for many run-of-the-mill business applications.

    In the recent past, most superscalar chips had only a very limited ability to extract instruction-level parallelism from most programs — that didn’t stop Intel from mass-producing those chips, even though they were utilized relatively inefficiently. Those chips (e.g. Pentium IV) were still useful, despite the inefficiency. Similarly, manycore chips may still be useful, even if they are relatively bandwidth-starved — which would make communication-intelligent designs increasingly important.

  6. @Conway

    Sure. It could happen that in the future the cores will be constantly data-starved. Nobody can predict the future. But it is not the case right now. Lightweight compression outperforms and has outperformed for years if not decades heavy compression within databases.

    (Some papers have claimed that databases are I/O bound. They have just not convinced me, nor the database industry.)

  7. Glenn Davis says:

    Ready for a radical idea? Too bad, here it is anyway.

    Having to categorize one’s compression method as being lightweight or heavyweight suggests to me that the method just isn’t very good or appropriate for the data. Good methods do a good job of data modeling, and with really good data modeling the otherwise-competing performance goals of speed and size can go hand in hand. To me, in the case of structured databases, good data modeling means multidimensional data modeling; unfortunately, nearly all the methods now being used are inherently one-dimensional and, predictably, wind up requiring compromise.

    I say that after having developed compression software that achieves almost 8-to-1 compression of the TPC-H lineitem table, a common benchmark in the DBMS world. That is far beyond published results from Oracle and IBM, and it demonstrates how much better, and more appropriate, multidimensional data modeling is when one is dealing with multidimensional data.

  8. @Glenn

    The type of compression ratio you are referring to is already possible using publicly available algorithms:

    V. Raman and G. Swart. Entropy compression of relations and querying of compressed relations. In VLDB, 2006.

  9. Glenn Davis says:

    That’s a super paper and a good start in the right direction! Those results, although measured on entropy-reduced data, illustrate the power of multidimensional approaches and the goal alignment (speed with size) one can get from good models.

    Their paper contains a footnote that I found not only intriguing but suggestive of a direction to go to reach the next level of performance:

    “By modeling the tuple sources as i.i.d., we lose the ability to exploit inter-tuple correlations. To our knowledge, no one has studied such correlations in databases – all the work on correlations has been among fields within a tuple. If inter-tuple correlations are significant, the information theory literature on compression of non zero-order sources might be applicable.”

    Funny they should say that! The answer is both yes and no. Yes, inter-tuple correlations can and should be exploited to compress structured data; I led a team that did that with great success some 20 years ago. And no, we found the information theory literature to be irrelevant. Information theory concerns encoding modeled data, not the design of the data models themselves. That is where the challenges and benefits lie.