Daniel Lemire's blog

, 5 min read

How should you build a high-performance column store for the 2020s?

8 thoughts on “How should you build a high-performance column store for the 2020s?”

  1. Priyanka says:

    Hey!
    That’s a really nice post!

    Also, checkout this link for blog posts on data science and programming etc. https://www.scholarspro.com/blog/

  2. hh says:
  3. Andy Pavlo says:

    You’re forgetting query codegen + compilation. That makes a big difference in runtime performance.

  4. It looks like the link “FastPFor Java” is referring to C++ (it is the same as C++ link). Is that intentional?

    1. It wasn’t. I appreciate the correction.

  5. Oren Tirosh says:

    The Blosc meta-compression engine (blosc.org) is widely used in data science. It supports multiple codecs as well as pre-compression transforms that greatly improve compression ratios for many types of data.

    The HDF group supports Blosc. Perhaps Arrow should consider it, too?

    https://www.hdfgroup.org/2016/02/the-blosc-meta-compressor/

  6. I agree with Andy that generating code at runtime and hotspot JIT compilation to machine code are making a big difference. We’re hitting the limits of what’s possible with interpretation.

    Another aspect is logical query optimization that is often overlooked. Improvements in the execution layer and storage can give you a 10x improvement, picking the wrong plan due to missing or bad logical optimization can impact your performance by 100x-1000x.

    One final aspect I want to mention is that there is a tendency to over-optimize for read/scan-only use cases. An increasingly important requirement is the ability to create the storage format quickly and to be able to maintain it efficiently (e.g., in-place updates).

  7. Allan Wind says:

    “useful top” should be “useful to” in case you want to fix a 3 year old post.