28th November 2022, 15 min read

Generic number compression (zstd)

16 thoughts on “Generic number compression (zstd)”

Ethan says:

November 28, 2022 at 7:11 pm

“from one byte to eight byte” should this read in the opposite order: “from eight byte to one byte” ?
1. Daniel Lemire says:
  
  November 28, 2022 at 10:39 pm
  
  You are correct. Thank you.
Stefan says:

November 29, 2022 at 1:09 am

Interesting. The default compression level of zstd seems to be only 3 of 22, though. And brotli can squeeze even more out of it, although at a high price (of CPU time).

Here are some more results for all levels of zstd, brotli and gzip:
https://github.com/evelance/generic-number-compression

Somehow, brotli manages to get the file size down to 44% for the float32-in-double file:
Versions: gzip 1.12, brotli 1.0.9, zstd 1.5.2 Checking compression for file 'testfloat.dat' 00000000: 0000 00c0 128d e63f 0000 00a0 f321 cf3f 00000010: 0000 00a0 2580 eb3f 0000 00e0 012a ea3f gzip-1 0.14s 4497086 56.21% gzip-5 0.26s 4217599 52.72% gzip-9 5.50s 4093342 51.17% brotli-0 0.02s 4835457 60.44% brotli-5 0.31s 4045934 50.57% brotli-9 1.55s 3986579 49.83% brotli-11 10.15s 3517421 43.97% zstd-1 0.02s 4508213 56.35% zstd-3 0.04s 4190227 52.38% zstd-8 0.17s 3878348 48.48% zstd-16 1.56s 3754120 46.93% zstd-22 2.31s 3755501 46.94% Checking compression for file 'testint.dat' 00000000: 7e00 0000 0000 0000 f1ff ffff ffff ffff 00000010: 2200 0000 0000 0000 2100 0000 0000 0000 gzip-1 0.06s 1896180 23.70% gzip-5 0.15s 1675779 20.95% gzip-9 7.20s 1519492 18.99% brotli-0 0.01s 1743049 21.79% brotli-5 0.15s 1523142 19.04% brotli-9 0.48s 1521837 19.02% brotli-11 9.44s 1234645 15.43% zstd-1 0.02s 1593200 19.91% zstd-3 0.02s 1656052 20.70% zstd-8 0.12s 1675177 20.94% zstd-16 1.56s 1323872 16.55% zstd-22 2.67s 1297221 16.22%

I am currently pondering on the implementation of a time series database for tiny embedded devices and simply compressing a list of appropriately sized (delta) values yields pretty good results 🙂

By the way, can you recommend a good compression algorithm for uint32 timestamp values that are increasing or strictly increasing? A pointer to the right direction would be greatly appreciated.
1. The Alchemist says:
  
  November 29, 2022 at 10:21 pm
  
  Depends how extreme you wanna get, and what the requirements are. 🙂 Like, do you need random access? What language?
  
  sux4j, a Java package, has a large list of data structures for this kind of thing that provide close to the information-theoretical lower bound, like the Elias-Fano encoding. There’s a C++ implementation from Facebook (https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h). You mentioned embedded, so that’s why I threw the C++ lib in there. I bet there’s a C implementation out there too.
  
  Also, check out https://pdal.io/en/stable/, a LIDAR compression software.
  1. Stefan says:
    
    December 4, 2022 at 1:56 pm
    
    Thanks a lot for the links!
    
    For the requirements, the code size is ~1MB, RAM ~16MB and the database is capped at 10MB. So, actually tiny 🙂
    
    Random indexing and performance is generally not important, the database just needs to compress as many IoT data points as possible (our test data has ~60M data points) and deliver time segments of values for a graph.
2. Maks Verver says:
  
  December 5, 2022 at 1:24 pm
  
  You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.
  
  The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy.
  
  So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
  1. Daniel Lemire says:
    
    December 5, 2022 at 5:45 pm
    
    You are mistaken that the optimal compression for random floats as doubles should be “roughly equal to two”.
    
    Your computation suggests that the exact answer is ~2.6x.
3. Cristian Vasile says:
  
  January 15, 2023 at 6:41 pm
  
  FYI.
  Lossless compressor and decompressor for numerical data using quantiles
  https://github.com/mwlon/quantile-compression
Marcin Zukowski says:

November 29, 2022 at 8:06 am

Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.

For example, the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa), making every 8 bytes you generate have identical 4 bytes, and there are only 8 versions of 5-byte patterns there. zstd can surely compress that very well, with “-9” you’ll even get under your 50% threshold.

Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

In general, you’re right, it’s easy to generate data distributions where zstd will lose badly to specific encodings. On the flip side, for any of these encodings, there will be distributions where zlib will crush it 🙂

Side note: zstd for me is a true revolution in compression tech – the compression ratios and speed it provides makes most of the general purpose alternatives mostly obsolete IMO.

Fun!
Daniel Lemire says:

November 29, 2022 at 2:37 pm
Hi Daniel, interesting experiment, but I guess you’ll agree the results can not be generalized, as it depends dramatically on the details of the data.

I think we are in agreement.

I expect a codec like zstd to be often within a factor of two of a reasonable information theoretical bound when doing data engineering work. And it is often fast enough. There are specific instances, and these instances are important, where you can do much better (better compression, better speed)… and I care a lot about these instances… but if you just have generic data… then using something like zstd will be good enough… meaning that the engineering work needed to do better will not be worth the effort.

the floats you’re generating are not really random, as you’re using a very small subset of the mantisa domain (half of your values will have the exact same mantisa)

Am I? The code is meant to generate random numbers between 0 and 1…
```
 std::random_device rd;
  std::default_random_engine eng(rd());
  std::uniform_real_distribution<float> distr(0.0f, 1.0f);
  for (size_t i = 0; i < N; i++) {
    x[i] = distr(eng);
  }
```
Admittedly, not all floats fall in this interval… Only about a quarter of them… so I expect slightly less than 30 bits of randomness per float…

Looking at the raw data, I do not see that half of the mantissa have the exact same value… maybe I misunderstand what you meant?
```
 ./generate & hexdump test.dat |  head -n 10
0000000 0000 0000 1c50 3fcb 0000 4000 e34a 3feb
0000010 0000 e000 8443 3fee 0000 6000 6eeb 3fdb
0000020 0000 4000 5b10 3fbf 0000 c000 f3ae 3fd8
0000030 0000 0000 3b2b 3fe2 0000 8000 eb88 3fb4
0000040 0000 e000 fa1a 3fe4 0000 a000 de15 3fbb
0000050 0000 4000 1eb4 3fe5 0000 e000 833a 3fe8
0000060 0000 c000 906b 3fe0 0000 e000 88e3 3fdf
0000070 0000 a000 69d3 3fe2 0000 0000 4785 3f92
0000080 0000 8000 dd59 3fe5 0000 2000 613a 3fc1
```
Similarly, if you change your integer domain to [0, 255], suddenly you get to 13% compression, because you only generate 7-byte sequences of 00s, not both 00s and FFs.

In that scenario, we get roughly an 8x compression ratio, so effectively as good as it gets. When I built my example, I deliberately used a signed value because I think it is more impressive that you can get a 5x compression ratio with signed values !!!

I was neither trying to set zstd for a fall nor trying to make it look good.
1. Marcin Zukowski says:
  
  November 29, 2022 at 5:48 pm
  
  Argh, silly me, I meant exponent, not mantisa – half of the values will fall in the range [0.5, 1) which will have the same mantisa 🙂 Sorry, I was typing this late.
  1. Marcin Zukowski says:
    
    November 29, 2022 at 5:49 pm
    
    Ha, you see, my brain does it again. “which will have the same exponent” !
    1. Daniel Lemire says:
      
      November 29, 2022 at 6:04 pm
      
      Smart people make mistakes and yet the world does not fall apart.
2. Maks Verver says:
  
  December 5, 2022 at 1:29 pm
  
  (Sorry, I posted this comment in reply to Stefan above, but it’s more appropriate here. Please feel free to delete my earlier reply.)
  
  The entropy of IEEE floats between 0 and 1 is (slightly under) 25 bits: 23 bits for the mantissa, plus 2 bits for the exponent, using Shannon’s formula for entropy. So that means that random floats should already be compressible to a ratio of 25/32 = 0.78125 = ~78% and when converted to doubles, the optimal compression ratio is 25/64 = 0.390625 = ~39%, not 50%. That explains why Stefan was able to compress these files in much less than 50%, which would have been theoretically impossible otherwise.
Antoine says:

November 30, 2022 at 4:13 pm

It should be noted that zstd is not exclusive to type-specific compression.
For example, the “byte stream split” encoding recently added to the Parquet format provides a valuable preprocessing step that increases the efficiency of Zstd compression on floating-point data:
https://issues.apache.org/jira/browse/PARQUET-1622
1. Daniel Lemire says:
  
  November 30, 2022 at 4:21 pm
  
  zstd is generic, indeed, and there are definitively preprocessing steps that may help compression.