Daniel Lemire's blog

, 2 min read

Random identifiers are poorly compressible

2 thoughts on “Random identifiers are poorly compressible”

  1. Thomas Müller Graf says:

    Compressibility is one concern, performance is another (write, lookup,…). Some operations can be 100 times slower with random identifiers. To solve this, there is a new types of UUIDs that are “less” random: https://news.ycombinator.com/item?id=28088213 – by using a more-or-less precise timestamp, plus some random bits. I think it would be interesting to analyze compressibility with those as well.

  2. The reason to use uuid column types in a database is so you can skip locking dB sequences or using synchronization strategies like HiLo and just generate collision-free identifiers client-side. Additionally, some databases will use the randomness to give you free uniformly distributed sharding.

    But those random identifiers have no intrinsic value. So long as your database is a complete view of the data you wish to compress (i.e. a closed graph with nothing from the outside world pointing in) then one distinct identifier is as good as another. You can consider a compression strategy that replaces uuid x with an (irreversible) one-to-one mapping value y as a form of lossy compression that gives you an equivalent but not identical dataset upon decompression.

    Depending on where, how, and why the compression is taking place, that could be good enough. If there is an additional non-constrained channel available for the non-hot path, the mapping could also be exported separately to allow for reversing the mappings out-of-band.