Daniel Lemire's blog

, 18 min read

What is the space overhead of Base64 encoding?

25 thoughts on “What is the space overhead of Base64 encoding?”

  1. Mike Leist says:

    How about using base91?

  2. magnus says:

    Thank you for the interesting post. What do you mean by „Privacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.“?

    Base64 itself is an encoding scheme and not an encryption algorithm. Therefore it does not provide secrecy, but merely obscurity. Do I miss something here?

    1. Each HTTP request, even if encrypted, leaks information about what you are doing. I know which server you queried, how often and how big the payloads were.

  3. mati says:

    lmao ur a proffesor? it’s all obvious

    1. A “proffesor”?

  4. Jon says:

    What’s the original file size after gzip? In other words gzip(original) compared to gzip(base64(original)) and gzip(base64(gzip(original)))

    Jon

    1. KWillets says:

      The files are available for download. bing.png.gz is 1387 bytes.

      1. Thanks. My guess is that most servers will not try to compress a PNG file with gzip.

  5. Thanks for the good primer, easy to follow.

    For clarity though, would you consider adding a column which shows the gzipped version of the original? Given most servers and browsers enable this transparently by default these days it might be misleading for some to omit it.

    Also, if you’re thinking of a follow-up it would be great to compare typical CPU cycles for each approach. With IoT all the rage, storage and bandwidth often gives way to processing budget.

  6. Matthew Self says:

    Why would you use Base64 and then gzip? The point of using Base64 is so that the output only uses a safe subset of ASCII. But gzip will turn that into binary. If you’re going to compress a file, there is no value in using Base64 first.

    A more common use case is to compress the file first (using gzip, JPEG, or whatever is appropriate for the file) and then use Base64 to make the compressed file safe for transmission via email.

    1. Tom Ribbens says:

      The author was speaking about web servers. In current HTTP traffic, most content is automatically gzip compressed when sent out from a webserver.

      However, that means that the comparison should be done between the binary compressed and the base64 compressed. That’s the true comparison for real world situations.

      1. That’s what I was thinking too; you can’t just say “Oh, but compressed Base64 is almost as small as uncompressed binary”. That’s beside the point. That being said, many binary formats like JPEG are already compressed, so gzipping those may not help much, but after reducing the entropy by base64 encoding the data, it makes sense that it becomes easier to compress again.

        Ultimately I don’t see much of a point though, as images easily get much larger than text-based formats like HTML. And the more of your data is static content, the more you can profit by caching it.

        Also base64 wastes 1/4 of the bits, not 1/3, plus a few bits depending on how the data aligns. So for large amounts of data, it’s essentially 25% of wasted bits.

        1. Also base64 wastes 1/4 of the bits, not 1/3

          My blog post is explicit as to what I mean: the base64 version of a file is 4/3 larger than it might be. You send 4 bytes for 3 bytes of actual information.

          1. José says:

            Hello, thank you for the article, it is very clear. I think you have an error in the sentence “So we use 33% more storage than we could.”. It should be a 25%, because you are using 8 bits for each 6 bits. In fact, the “eficciency” of Base64 is 75%.

            1. José says:

              [Correction] Hello, thank you for the article, it is very clear. I think you have an error in the sentence “So we use 33% more storage than we could.”, it should be “So we use at least 33% more storage than we could.” The increase may be larger if the encoded data is small. For example, the string “b” with length === 1 is encoded as “Yg==” with length === 4 — a 300% increase.

      2. How certain are you that images are commonly served with gzip compression?

    2. Andrew Dalke says:

      The test here approximates the result of Base64 encoding to place an image in an HTML document, then using (negotiated) HTTP compression when exchanging the document.

      That is, for various reasons people may want to embed an image in an HTML document rather than provide a hypertext reference to the image. Most image formats contain binary data. HTML does not support embedding arbitrary binary objects, so the data must be encoded. Most people embed images as a data URI for the ‘img’ element. The data URI supports “base64” as the only available encoding scheme, so most people embed using Base64 encoding.

      The HTTP document is then transferred over HTTP. HTTP supports automatic compression, if the client/server can agree on a compression scheme. The most widely used scheme is ‘gzip’, which is the same method used in the gzip command-line program.

      Thus, it is reasonable to approximate the payload overhead of base64-escaped data URIs followed by gzip HTTP compression, by taking the image file, Base64-encoding it, and using gzip to compress the result.

      This was described in the text as “look at the HTML source” and “It is common for web servers to provide the content in compressed form”. This is a well-known topic that typically doesn’t deserve the level of detail I just gave.

      1. Thanks Andrew for the detailed explanation.

        1. KWillets says:

          Since the bitstream is already compressed, you’re probably seeing entropy compression for 64 uniformly distributed values. You might try inlining the images into a typical document (with more skew) to see if the compression holds up, i.e. their mutual information.

  7. Travis Downs says:

    You compare the size of the original content to the size of the base64 + gzipped result, but isn’t a more interesting comparison the one between the only-gzipped original content vs base64 + gzipped content? After all, we expect that independently from whether base64 is needed in the protocol, compression will be applied.

    For your corpus of .png and .jpeg files, I don’t expect you to see much of a difference, since both png and jpeg are already backed by at-least-as-good-as-deflate coding, so re-compression is generally minimal. So all gzip is doing is undoing (via entropy coding, as the matching portion is probably useless) specifically the base64 inflation (and the fact that it still has a 5% overhead shows that it isn’t a particularly efficient entropy coder).

    For files that actually can be compressed, however, the results may be very different – and in realistic cases I think the result could be a penalty larger than 33% for base64, as the encoding can interfere with the compression.

  8. Twirrim says:

    It’s worth noting that good practice has more recently tended to favour not gzip’ing content when hosting via TLS, due to compression + TLS enabling BREACH attacks: https://en.wikipedia.org/wiki/BREACH

    That actually makes the base64 inefficiencies even more glaring.

    1. Do we have any statistics on gzip usage? I have tried a few well-known sites, and they all appear to serve the content in compressed form. For example, GMail uses gzip. You would think that Google would be on top of things, security-wise. Or is that a security issue that is specific to some form of secure layers and not others?

      1. Twirrim says:

        Interesting question. So far as I understand it, it applies to all layers, but I’m not an expert in that type of thing.

        I don’t have access to the Alexa top 500 list, but Moz has a list of 500, https://moz.com/top500. I wrote some very rough code (https://gist.github.com/twirrim/877bcaf373aa1fec99c102b7c84ea1ce), using python3 and the requests library, to go through and check for Content-Encoding appearing in the headers of responses for them:

        {False: 53, True: 390, ‘Unknown’: 57}

        Unknown is a catchall for “Something didn’t go right” rather than indicative of any confusion about if compression is enabled.

        So more use it than don’t, by a good margin.

  9. Mark says:

    Just don’t rely on size estimations to limit ingress traffic: Some base64 encodings allow comments, which can be used to amplify ratio bytes_in/bytes_decoded.

    1. ASCII spaces are certainly possible within base64 encoded text, but I have never seen comments. Do you have an example in the wild or a reference to the part of the specification that allows comments?