30th June 2021, 9 min read

Compressing JSON: gzip vs zstd

9 thoughts on “Compressing JSON: gzip vs zstd”

Oren Tirosh says:

July 1, 2021 at 4:34 am

Blosc has demonstrated that using a really fast codec with modest compression ratio you can actually speed up processing relative to using uncompressed data by relieving load on the bottleneck of main memory fetching. This only helps for a large number of threads, though. Not single cpu performance.

But if this can be true for DRAM, it can definitely be relevant to disk and network. So while .json.zstd may be good over the internet I expect .json.lz4 to be beneficial almost always.

I wonder how fast a tightly coupled lz4-json decoder with an intermediate buffer size optimized for L1 cache can get.
1. Alexey Milovidov says:
  
  July 1, 2021 at 10:34 pm
  
  Yes, in our experience with ClickHouse, we can improve in-memory processing speed with enabling compression. I have a presentation about it: https://presentations.clickhouse.tech/meetup53/optimizations/
Chris Rorden says:

July 1, 2021 at 11:58 am

gzip is a legacy format with a lot of design choices from a bygone era. Therefore, zstd has many inherent benefits. Therefore, it is not surprising that zstd dominates the Pareto frontier. However, there are several gz format tools that vastly outperform gzip for compression and decompression. You note CloudFlare, however if you are focused on de-compression it is worth looking at libdeflate or Intel’s igzip which has extremely fast decompression on x86-64 machines. Related, for web applications Google’s Chrome does include a number of optimizations (some of which have found there way into CloudFlare).
ImreSamu says:

July 1, 2021 at 2:38 pm

IMHO: the correct scripts Link :

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2021/06/30

related: The PostgreSQL hackers discussing ZSON (improvements) ( 2021-may-jun )

https://www.postgresql.org/message-id/flat/4aca1d4c-aa07-c168-bcca-236ec9f04c8d%40dunslane.net#ed814406717fc0915178261de7fd7e4a

ZSON = “ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).” https://github.com/postgrespro/zson
John Boero says:

July 1, 2021 at 3:57 pm

Awesome I love a good compression discussion. Zstd blew my mind and in fact I read the zip file standard just added support for the zstd compression standard. So many open source options have been blowing away legacy proprietary compression lately – it’s a great time to be alive.

In addition to Facebook’s Zstd Google gave us Brotli, which is tuned for compressing web content like HTML and JSON. Brotli is slower than Zstd but it has a static dictionary based on the most common words or utf8 on the internet, and often compresses my JS/JSON 5-30% better than even Zstd. For example here is an OpenAPI schema that’s 412KB raw, 28k-38k in Zstd, and 26k in Brotli. Zstd has blown Brotli out of the water in speed but Brotli still compresses better for static compression of text like this.

412K test.json
26K test.json.br
35K test.json.gz
38K test.json.zst (default level)
28K test.json.zst (max level 19)
Cip says:

July 5, 2021 at 8:35 am

For completeness, can you add the speed benchmarks for compression as well?
Joe Duarte says:

July 10, 2021 at 2:40 pm

Dan, which application did you use for gzip compression? This is important because it impacts the results. You mentioned that web servers use gzip. Note that the application / library they use is actually zlib. However, the command line application that comes with many Linux and Unix-based OSes is GNU Gzip. I can’t tell from your script which of these you used, but I’m guessing it was GNU Gzip, which is not what web servers use.

This is GNU Gzip: https://www.gnu.org/software/gzip/

This is zlib: https://github.com/madler/zlib

They’re totally different projects and codebases. The zlib library can generate three different formats: deflate, gzip, and the official zlib format. They all use deflate, but the headers are different. Web servers like Apache and nginx generate gzip files using the zlib library.

Also, the compression levels you used are unclear. You didn’t specify any in your script, so it would be the defaults. There’s nothing special or authoritative about the defaults for benchmarking purposes, so it would be worth trying at least a few levels. I have no idea what the gzip default compression level is for either GNU Gzip or zlib, but for Zstd it’s level 3. That’s out of 22 possible levels, so it’s near the lowest ratio Zstd produces. The difference between your gzip and Zstd results would likely be much greater if you tried higher levels, since Zstd improves dramatically as you go up the levels, whereas gzip generally doesn’t improve much beyond level 6 (out of 9).

Note that the current benchmarks for gzip compression are not GNU Gzip or zlib, which are old projects that emphasize compatibility with old computers. The benchmarks are libdeflate (by Eric Biggers) and zopfli (by Google, probably Jyrki Alakuijala and others). They both compress better than zlib, and libdeflate is also much faster (zopfli is super slow). The Cloudflare fork of zlib hasn’t been maintained, and it wasn’t actually usable or buildable last time I checked.

https://github.com/ebiggers/libdeflate

https://github.com/google/zopfli
1. Daniel Lemire says:
  
  July 10, 2021 at 3:16 pm
  
  I have updated the blog post and the script with Eric Biggers’ code. It is indeed quite fast.
  1. Joe Duarte says:
    
    July 12, 2021 at 6:18 pm
    
    libdeflate offers more compression levels too. I think it’s 1 to 12, compared to 1 to 9 for typical gzip implementations. Those levels do in fact deliver better compression than legacy libraries like zlib – they’re not just there for granularity. e.g. libdeflate 12 compresses more than zlib 9 (and libdeflate 9 probably compresses more than zlib 9, and faster).