30th January 2019, 18 min read

What is the space overhead of Base64 encoding?

25 thoughts on “What is the space overhead of Base64 encoding?”

Mike Leist says:

January 30, 2019 at 9:25 pm

How about using base91?
magnus says:

January 31, 2019 at 12:35 am

Thank you for the interesting post. What do you mean by â€žPrivacy-wise, base64 encoding can have benefits since it hides the content you access in larger encrypted bundles.â€œ?

Base64 itself is an encoding scheme and not an encryption algorithm. Therefore it does not provide secrecy, but merely obscurity. Do I miss something here?
1. Daniel Lemire says:
  
  January 31, 2019 at 12:59 am
  
  Each HTTP request, even if encrypted, leaks information about what you are doing. I know which server you queried, how often and how big the payloads were.
mati says:

January 31, 2019 at 2:03 am

lmao ur a proffesor? it’s all obvious
1. Daniel Lemire says:
  
  January 31, 2019 at 6:00 pm
  
  A “proffesor”?
Jon says:

January 31, 2019 at 4:21 am

What’s the original file size after gzip? In other words gzip(original) compared to gzip(base64(original)) and gzip(base64(gzip(original)))

Jon
1. KWillets says:
  
  January 31, 2019 at 6:14 pm
  
  The files are available for download. bing.png.gz is 1387 bytes.
  1. Daniel Lemire says:
    
    January 31, 2019 at 7:34 pm
    
    Thanks. My guess is that most servers will not try to compress a PNG file with gzip.
John Issington says:

January 31, 2019 at 4:54 am

Thanks for the good primer, easy to follow.

For clarity though, would you consider adding a column which shows the gzipped version of the original? Given most servers and browsers enable this transparently by default these days it might be misleading for some to omit it.

Also, if you’re thinking of a follow-up it would be great to compare typical CPU cycles for each approach. With IoT all the rage, storage and bandwidth often gives way to processing budget.
Matthew Self says:

January 31, 2019 at 8:12 am

Why would you use Base64 and then gzip? The point of using Base64 is so that the output only uses a safe subset of ASCII. But gzip will turn that into binary. If you’re going to compress a file, there is no value in using Base64 first.

A more common use case is to compress the file first (using gzip, JPEG, or whatever is appropriate for the file) and then use Base64 to make the compressed file safe for transmission via email.
1. Tom Ribbens says:
  
  January 31, 2019 at 9:18 am
  
  The author was speaking about web servers. In current HTTP traffic, most content is automatically gzip compressed when sent out from a webserver.
  
  However, that means that the comparison should be done between the binary compressed and the base64 compressed. That’s the true comparison for real world situations.
  1. DarkWiiPlayer says:
    
    January 31, 2019 at 1:13 pm
    
    That’s what I was thinking too; you can’t just say “Oh, but compressed Base64 is almost as small as uncompressed binary”. That’s beside the point. That being said, many binary formats like JPEG are already compressed, so gzipping those may not help much, but after reducing the entropy by base64 encoding the data, it makes sense that it becomes easier to compress again.
    
    Ultimately I don’t see much of a point though, as images easily get much larger than text-based formats like HTML. And the more of your data is static content, the more you can profit by caching it.
    
    Also base64 wastes 1/4 of the bits, not 1/3, plus a few bits depending on how the data aligns. So for large amounts of data, it’s essentially 25% of wasted bits.
    1. Daniel Lemire says:
      
      January 31, 2019 at 5:28 pm
      
      Also base64 wastes 1/4 of the bits, not 1/3
      
      My blog post is explicit as to what I mean: the base64 version of a file is 4/3 larger than it might be. You send 4 bytes for 3 bytes of actual information.
      1. José says:
        
        October 31, 2022 at 10:16 am
        
        Hello, thank you for the article, it is very clear. I think you have an error in the sentence “So we use 33% more storage than we could.”. It should be a 25%, because you are using 8 bits for each 6 bits. In fact, the “eficciency” of Base64 is 75%.
        
        José says:
        
        October 31, 2022 at 11:28 am
        
        [Correction] Hello, thank you for the article, it is very clear. I think you have an error in the sentence “So we use 33% more storage than we could.”, it should be “So we use at least 33% more storage than we could.” The increase may be larger if the encoded data is small. For example, the string “b” with length === 1 is encoded as “Yg==” with length === 4 — a 300% increase.
  2. Daniel Lemire says:
    
    January 31, 2019 at 5:43 pm
    
    How certain are you that images are commonly served with gzip compression?
2. Andrew Dalke says:
  
  January 31, 2019 at 1:11 pm
  
  The test here approximates the result of Base64 encoding to place an image in an HTML document, then using (negotiated) HTTP compression when exchanging the document.
  
  That is, for various reasons people may want to embed an image in an HTML document rather than provide a hypertext reference to the image. Most image formats contain binary data. HTML does not support embedding arbitrary binary objects, so the data must be encoded. Most people embed images as a data URI for the ‘img’ element. The data URI supports “base64” as the only available encoding scheme, so most people embed using Base64 encoding.
  
  The HTTP document is then transferred over HTTP. HTTP supports automatic compression, if the client/server can agree on a compression scheme. The most widely used scheme is ‘gzip’, which is the same method used in the gzip command-line program.
  
  Thus, it is reasonable to approximate the payload overhead of base64-escaped data URIs followed by gzip HTTP compression, by taking the image file, Base64-encoding it, and using gzip to compress the result.
  
  This was described in the text as “look at the HTML source” and “It is common for web servers to provide the content in compressed form”. This is a well-known topic that typically doesn’t deserve the level of detail I just gave.
  1. Daniel Lemire says:
    
    January 31, 2019 at 1:38 pm
    
    Thanks Andrew for the detailed explanation.
    1. KWillets says:
      
      January 31, 2019 at 6:28 pm
      
      Since the bitstream is already compressed, you’re probably seeing entropy compression for 64 uniformly distributed values. You might try inlining the images into a typical document (with more skew) to see if the compression holds up, i.e. their mutual information.
Travis Downs says:

February 1, 2019 at 2:20 am

You compare the size of the original content to the size of the base64 + gzipped result, but isn’t a more interesting comparison the one between the only-gzipped original content vs base64 + gzipped content? After all, we expect that independently from whether base64 is needed in the protocol, compression will be applied.

For your corpus of .png and .jpeg files, I don’t expect you to see much of a difference, since both png and jpeg are already backed by at-least-as-good-as-deflate coding, so re-compression is generally minimal. So all gzip is doing is undoing (via entropy coding, as the matching portion is probably useless) specifically the base64 inflation (and the fact that it still has a 5% overhead shows that it isn’t a particularly efficient entropy coder).

For files that actually can be compressed, however, the results may be very different – and in realistic cases I think the result could be a penalty larger than 33% for base64, as the encoding can interfere with the compression.
Twirrim says:

February 6, 2019 at 12:58 am

It’s worth noting that good practice has more recently tended to favour not gzip’ing content when hosting via TLS, due to compression + TLS enabling BREACH attacks: https://en.wikipedia.org/wiki/BREACH

That actually makes the base64 inefficiencies even more glaring.
1. Daniel Lemire says:
  
  February 6, 2019 at 3:25 pm
  
  Do we have any statistics on gzip usage? I have tried a few well-known sites, and they all appear to serve the content in compressed form. For example, GMail uses gzip. You would think that Google would be on top of things, security-wise. Or is that a security issue that is specific to some form of secure layers and not others?
  1. Twirrim says:
    
    February 6, 2019 at 9:24 pm
    
    Interesting question. So far as I understand it, it applies to all layers, but I’m not an expert in that type of thing.
    
    I don’t have access to the Alexa top 500 list, but Moz has a list of 500, https://moz.com/top500. I wrote some very rough code (https://gist.github.com/twirrim/877bcaf373aa1fec99c102b7c84ea1ce), using python3 and the requests library, to go through and check for Content-Encoding appearing in the headers of responses for them:
    
    {False: 53, True: 390, ‘Unknown’: 57}
    
    Unknown is a catchall for “Something didn’t go right” rather than indicative of any confusion about if compression is enabled.
    
    So more use it than don’t, by a good margin.
Mark says:

February 6, 2019 at 5:52 pm

Just don’t rely on size estimations to limit ingress traffic: Some base64 encodings allow comments, which can be used to amplify ratio bytes_in/bytes_decoded.
1. Daniel Lemire says:
  
  February 6, 2019 at 7:57 pm
  
  ASCII spaces are certainly possible within base64 encoded text, but I have never seen comments. Do you have an example in the wild or a reference to the part of the specification that allows comments?