Daniel Lemire's blog

, 2 min read

Compressing document-oriented databases by rewriting your documents

The space utilization of relational databases can be estimated quickly. If you create a table made of three columns, each containing an integer, you can expect the database to use roughly 12 bytes per row, plus some overhead. Unless your database is tiny, how you name your columns is irrelevant to the space utilization.

Document-oriented databases such as MongoDB are not so simple. There is room for optimization. Using short names for attributes is better. For example, in going from JSON tuples of the form

{date_achat:'1999-06-30',article:'Echasses',
quantite:1,prix:2800}

to these tuples where one attribute has a longer name

{date_achat:'1999-06-30',articlefromoutstore:'Echasses',
quantite:1,prix:2800}

you increase the space utilization per tuple by 12 bytes (from 105 to 117 bytes per tuple).

The converse is true. Using shorter names is better:

{d:'1999-06-30',a:'Echasses',q:1,p:2800}

The space utilization per tuple goes down to 80 bytes (from 105 bytes). This is a saving of over 20%.

It is tempting to do away with the attribute names entirely and save the data as array:

['1999-06-30','Echasses',1,2800]

Yet the space utilization remains at 80 bytes because the binary format used by MongoDB (BSON) does not store arrays concisely.

Should we worry about this issue? We live in an era of abundant storage and memory. MongoDB pre-allocates the storage to avoid disk fragmentation. Even the tiniest collection will use 128 MB, and larger collections are stored in 2 GB files: MongoDB is unafraid to waste nearly 2 GB or more. In fact, we might say that it is precisely because we live in such abundance that we can afford to use document-oriented databases. However, engineers still face problems with space utilization. Hence, it is useful to be aware of the effect that the names you choose will have, especially if you come from a relational database context where name length is irrelevant.