One claim that caught my attention was compressibility in column oriented databases: the column stores tend to compress very well, significantly increasing IO bandwidth (x number of bytes from disk translates to >> x number of bytes of actual data). Since most DBs are IO bound, this turns out to provide a big real-world performance advantage. What do you think?
Yes, but “most DBs are IO bound” is not the entire explanation. Here are two finer points:
A) It is not all that true.
On this blog (search for it), I have run experiments showing that parsing CSV files was easily CPU bound. Of course, you have to define properly what “parsing” means… I mean here to find the strings, then copy them into some data structure.
That is why databases use relatively “cheap” compression techniques. Going out of your way to squeeze the data down might be counterproductive. Compression is not everything.
Thankfully, column-oriented designs allow “cheap” compression techniques to work well. Basically, you sort the data (a relatively cheap operation) and then you then you apply run-length encoding.
B) Compression is not only about reducing IO costs.
As an example, is it faster to compute the sum of:
111122222
or
4×1, 5×2
?
Clearly, it is faster to compute the sum of the “compressed” array. So compression can also save CPU cycles when *you are operating directly over the compressed data stream*. Whenever you need to load the data in RAM, then uncompress it, and then work over the uncompressed data, you have to worry that you will overload your memory bandwidth.
Indeed, compressibility seemed to be a major advantage they observed in using a column store vs. in Hadoop. Wasn’t at all clear to me what that was / should be the case.
One claim that caught my attention was compressibility in column oriented databases: the column stores tend to compress very well, significantly increasing IO bandwidth (x number of bytes from disk translates to >> x number of bytes of actual data). Since most DBs are IO bound, this turns out to provide a big real-world performance advantage. What do you think?
@Parand
Yes, but “most DBs are IO bound” is not the entire explanation. Here are two finer points:
A) It is not all that true.
On this blog (search for it), I have run experiments showing that parsing CSV files was easily CPU bound. Of course, you have to define properly what “parsing” means… I mean here to find the strings, then copy them into some data structure.
That is why databases use relatively “cheap” compression techniques. Going out of your way to squeeze the data down might be counterproductive. Compression is not everything.
Thankfully, column-oriented designs allow “cheap” compression techniques to work well. Basically, you sort the data (a relatively cheap operation) and then you then you apply run-length encoding.
B) Compression is not only about reducing IO costs.
As an example, is it faster to compute the sum of:
111122222
or
4×1, 5×2
?
Clearly, it is faster to compute the sum of the “compressed” array. So compression can also save CPU cycles when *you are operating directly over the compressed data stream*. Whenever you need to load the data in RAM, then uncompress it, and then work over the uncompressed data, you have to worry that you will overload your memory bandwidth.
Indeed, compressibility seemed to be a major advantage they observed in using a column store vs. in Hadoop. Wasn’t at all clear to me what that was / should be the case.