10th November 2017, 5 min read

How should you build a high-performance column store for the 2020s?

8 thoughts on “How should you build a high-performance column store for the 2020s?”

Priyanka says:

November 10, 2017 at 8:19 pm

Hey!
That’s a really nice post!

Also, checkout this link for blog posts on data science and programming etc. https://www.scholarspro.com/blog/
hh says:

November 11, 2017 at 12:09 pm

also you can check https://github.com/yandex/ClickHouse
Andy Pavlo says:

November 11, 2017 at 1:08 pm

You’re forgetting query codegen + compilation. That makes a big difference in runtime performance.
Vladimir Sitnikov says:

November 11, 2017 at 1:25 pm

It looks like the link “FastPFor Java” is referring to C++ (it is the same as C++ link). Is that intentional?
1. Daniel Lemire says:
  
  November 11, 2017 at 6:14 pm
  
  It wasn’t. I appreciate the correction.
Oren Tirosh says:

November 12, 2017 at 1:42 pm

The Blosc meta-compression engine (blosc.org) is widely used in data science. It supports multiple codecs as well as pre-compression transforms that greatly improve compression ratios for many types of data.

The HDF group supports Blosc. Perhaps Arrow should consider it, too?

https://www.hdfgroup.org/2016/02/the-blosc-meta-compressor/
Tobias Muehlbauer says:

November 20, 2017 at 6:25 pm

I agree with Andy that generating code at runtime and hotspot JIT compilation to machine code are making a big difference. We’re hitting the limits of what’s possible with interpretation.

Another aspect is logical query optimization that is often overlooked. Improvements in the execution layer and storage can give you a 10x improvement, picking the wrong plan due to missing or bad logical optimization can impact your performance by 100x-1000x.

One final aspect I want to mention is that there is a tendency to over-optimize for read/scan-only use cases. An increasingly important requirement is the ability to create the storage format quickly and to be able to maintain it efficiently (e.g., in-place updates).
Allan Wind says:

September 11, 2020 at 4:47 am

“useful top” should be “useful to” in case you want to fix a 3 year old post.