Interesting. The default compression level of zstd seems to be only 3 of 22, though. And brotli can squeeze even more out of it, although at a high price (of CPU time).
I am currently pondering on the implementation of a time series database for tiny embedded devices and simply compressing a list of appropriately sized (delta) values yields pretty good results 🙂
By the way, can you recommend a good compression algorithm for uint32 timestamp values that are increasing or strictly increasing? A pointer to the right direction would be greatly appreciated.
The Alchemistsays:
Depends how extreme you wanna get, and what the requirements are. 🙂 Like, do you need random access? What language?
sux4j, a Java package, has a large list of data structures for this kind of thing that provide close to the information-theoretical lower bound, like the Elias-Fano encoding. There’s a C++ implementation from Facebook (https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h). You mentioned embedded, so that’s why I threw the C++ lib in there. I bet there’s a C implementation out there too.
For the requirements, the code size is ~1MB, RAM ~16MB and the database is capped at 10MB. So, actually tiny 🙂
Random indexing and performance is generally not important, the database just needs to compress as many IoT data points as possible (our test data has ~60M data points) and deliver time segments of values for a graph.
Maks Verversays:
You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.
The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy.
So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.
For example, the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa), making every 8 bytes you generate have identical 4 bytes, and there are only 8 versions of 5-byte patterns there. zstd can surely compress that very well, with “-9” you’ll even get under your 50% threshold.
Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.
In general, you’re right, it’s easy to generate data distributions where zstd will lose badly to specific encodings. On the flip side, for any of these encodings, there will be distributions where zlib will crush it 🙂
Side note: zstd for me is a true revolution in compression tech – the compression ratios and speed it provides makes most of the general purpose alternatives mostly obsolete IMO.
Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.
I think we are in agreement.
I expect a codec like zstd to be often within a factor of two of a reasonable information theoretical bound when doing data engineering work. And it is often fast enough. There are specific instances, and these instances are important, where you can do much better (better compression, better speed)… and I care a lot about these instances… but if you just have generic data… then using something like zstd will be good enough… meaning that the engineering work needed to do better will not be worth the effort.
the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa)
Am I? The code is meant to generate random numbers between 0 and 1…
std::random_device rd;
std::default_random_engine eng(rd());
std::uniform_real_distribution<float> distr(0.0f,1.0f);for(size_t i =0; i < N; i++){
x[i]= distr(eng);}
Admittedly, not all floats fall in this interval… Only about a quarter of them… so I expect slightly less than 30 bits of randomness per float…
Looking at the raw data, I do not see that half of the mantissa have the exact same value… maybe I misunderstand what you meant?
Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.
In that scenario, we get roughly an 8x compression ratio, so effectively as good as it gets. When I built my example, I deliberately used a signed value because I think it is more impressive that you can get a 5x compression ratio with signed values !!!
I was neither trying to set zstd for a fall nor trying to make it look good.
Marcin Zukowskisays:
Argh, silly me, I meant exponent, not mantisa – half of the values will fall in the range [0.5, 1) which will have the same mantisa 🙂 Sorry, I was typing this late.
Marcin Zukowskisays:
Ha, you see, my brain does it again. “which will have the same exponent” !
Smart people make mistakes and yet the world does not fall apart.
Maks Verversays:
(Sorry, I posted this comment in reply to Stefan above, but it’s more appropriate here. Please feel free to delete my earlier reply.)
The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy. So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
Antoinesays:
It should be noted that zstd is not exclusive to type-specific compression.
For example, the “byte stream split” encoding recently added to the Parquet format provides a valuable preprocessing step that increases the efficiency of Zstd compression on floating-point data: https://issues.apache.org/jira/browse/PARQUET-1622
“from one byte to eight byte” should this read in the opposite order: “from eight byte to one byte” ?
You are correct. Thank you.
Interesting. The default compression level of zstd seems to be only 3 of 22, though. And brotli can squeeze even more out of it, although at a high price (of CPU time).
Here are some more results for all levels of zstd, brotli and gzip:
https://github.com/evelance/generic-number-compression
Somehow, brotli manages to get the file size down to 44% for the float32-in-double file:
Versions: gzip 1.12, brotli 1.0.9, zstd 1.5.2
Checking compression for file 'testfloat.dat'
00000000: 0000 00c0 128d e63f 0000 00a0 f321 cf3f
00000010: 0000 00a0 2580 eb3f 0000 00e0 012a ea3f
gzip-1 0.14s 4497086 56.21%
gzip-5 0.26s 4217599 52.72%
gzip-9 5.50s 4093342 51.17%
brotli-0 0.02s 4835457 60.44%
brotli-5 0.31s 4045934 50.57%
brotli-9 1.55s 3986579 49.83%
brotli-11 10.15s 3517421 43.97%
zstd-1 0.02s 4508213 56.35%
zstd-3 0.04s 4190227 52.38%
zstd-8 0.17s 3878348 48.48%
zstd-16 1.56s 3754120 46.93%
zstd-22 2.31s 3755501 46.94%
Checking compression for file 'testint.dat'
00000000: 7e00 0000 0000 0000 f1ff ffff ffff ffff
00000010: 2200 0000 0000 0000 2100 0000 0000 0000
gzip-1 0.06s 1896180 23.70%
gzip-5 0.15s 1675779 20.95%
gzip-9 7.20s 1519492 18.99%
brotli-0 0.01s 1743049 21.79%
brotli-5 0.15s 1523142 19.04%
brotli-9 0.48s 1521837 19.02%
brotli-11 9.44s 1234645 15.43%
zstd-1 0.02s 1593200 19.91%
zstd-3 0.02s 1656052 20.70%
zstd-8 0.12s 1675177 20.94%
zstd-16 1.56s 1323872 16.55%
zstd-22 2.67s 1297221 16.22%
I am currently pondering on the implementation of a time series database for tiny embedded devices and simply compressing a list of appropriately sized (delta) values yields pretty good results 🙂
By the way, can you recommend a good compression algorithm for uint32 timestamp values that are increasing or strictly increasing? A pointer to the right direction would be greatly appreciated.
Depends how extreme you wanna get, and what the requirements are. 🙂 Like, do you need random access? What language?
sux4j, a Java package, has a large list of data structures for this kind of thing that provide close to the information-theoretical lower bound, like the Elias-Fano encoding. There’s a C++ implementation from Facebook (https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h). You mentioned embedded, so that’s why I threw the C++ lib in there. I bet there’s a C implementation out there too.
Also, check out https://pdal.io/en/stable/, a LIDAR compression software.
Thanks a lot for the links!
For the requirements, the code size is ~1MB, RAM ~16MB and the database is capped at 10MB. So, actually tiny 🙂
Random indexing and performance is generally not important, the database just needs to compress as many IoT data points as possible (our test data has ~60M data points) and deliver time segments of values for a graph.
You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.
The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy.
So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.
Your computation suggests that the exact answer is ~2.6x.
FYI.
Lossless compressor and decompressor for numerical data using quantiles
https://github.com/mwlon/quantile-compression
Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.
For example, the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa), making every 8 bytes you generate have identical 4 bytes, and there are only 8 versions of 5-byte patterns there. zstd can surely compress that very well, with “-9” you’ll even get under your 50% threshold.
Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.
In general, you’re right, it’s easy to generate data distributions where zstd will lose badly to specific encodings. On the flip side, for any of these encodings, there will be distributions where zlib will crush it 🙂
Side note: zstd for me is a true revolution in compression tech – the compression ratios and speed it provides makes most of the general purpose alternatives mostly obsolete IMO.
Fun!
Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.
I think we are in agreement.
I expect a codec like zstd to be often within a factor of two of a reasonable information theoretical bound when doing data engineering work. And it is often fast enough. There are specific instances, and these instances are important, where you can do much better (better compression, better speed)… and I care a lot about these instances… but if you just have generic data… then using something like zstd will be good enough… meaning that the engineering work needed to do better will not be worth the effort.
the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa)
Am I? The code is meant to generate random numbers between 0 and 1…
Admittedly, not all floats fall in this interval… Only about a quarter of them… so I expect slightly less than 30 bits of randomness per float…
Looking at the raw data, I do not see that half of the mantissa have the exact same value… maybe I misunderstand what you meant?
Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.
In that scenario, we get roughly an 8x compression ratio, so effectively as good as it gets. When I built my example, I deliberately used a signed value because I think it is more impressive that you can get a 5x compression ratio with signed values !!!
I was neither trying to set zstd for a fall nor trying to make it look good.
Argh, silly me, I meant exponent, not mantisa – half of the values will fall in the range [0.5, 1) which will have the same mantisa 🙂 Sorry, I was typing this late.
Ha, you see, my brain does it again. “which will have the same exponent” !
Smart people make mistakes and yet the world does not fall apart.
(Sorry, I posted this comment in reply to Stefan above, but it’s more appropriate here. Please feel free to delete my earlier reply.)
The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy. So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
It should be noted that zstd is not exclusive to type-specific compression.
For example, the “byte stream split” encoding recently added to the Parquet format provides a valuable preprocessing step that increases the efficiency of Zstd compression on floating-point data:
https://issues.apache.org/jira/browse/PARQUET-1622
zstd is generic, indeed, and there are definitively preprocessing steps that may help compression.