Daniel Lemire's blog

, 11 min read

Ridiculously fast base64 encoding and decoding

13 thoughts on “Ridiculously fast base64 encoding and decoding”

  1. Bingo Du says:

    Wonderful results!

  2. Translate says:

    Thanks, inspiring article!

  3. Great job!

    But you should warn about the use of AVX2.
    Unfortunately, the use of AVX2 severely throttles the CPU, which can cause system-wide performance issues as it affects other processes.

    See https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

    Nonetheless, great job pointing out it can be done better!

    1. But you should warn about the use of AVX2.

      The paper is called: “Faster Base64 Encoding and Decoding Using AVX2 Instructions”.

      Unfortunately, the use of AVX2 severely throttles the CPU, which can cause system-wide performance issues

      Intel reduces the turbo frequency depending on the instruction mix. On Skylake X, AVX-512 instructions have a greater effect than AVX2 instructions with multiplications and floating points. Simple AVX2 instructions can be used without any reduction to the turbo frequency. The effect is tiny on processors having few active cores (e.g. 4), unlikely to be measurable, but it is larger on wide chips with many active cores (e.g., 28).

      If you have a chip with many active cores (much more than 4) and if you have a CPU heavy load, and if AVX-512 does not accelerate the computation much, then you can get a negative outcome. This is discussed in Intel’s optimization manual.

      The link you refer to is in this scenario, they have 24-core processors, with all cores active, and they use AVX-512 instructions.

      1. Travis Downs says:

        To be fair to the grandparent poster, the “normal” frequency of almost any recent Intel chip is totally irrelevant. The chip almost never runs at that speed. It’s almost always either “off” (in some non-zero C-state), running at minimum frequency (i.e,. most efficient freq, usually at Vmin around 500-1000Mhz), or running at maximum turbo frequency. Rarely you’ll find it running at other frequencies between min up to including normal, which usually happens during workload transition.

        “Normal” (the frequency printed on the box) isn’t at all special here in terms of how often that’s used your chip probably runs at “normal” frequency less than 1% of the time. If you want to know how fast your CPU will run something, the turbo frequency is essentially the only number you need to know (and the turbo ratio rable for multiple running cores, unfortunately).

        Intel puts it on the box, probably for historical reasons and because of the confusing aspect of the turbo ration depending on the number of running CPUs, so for my 4-core CPU they can either say “2.6 GHz” or “3.5/3.4/3.3/3.2 GHz”, and for a 28-core CPU, well…

        Intel also positions the normal frequency as a the “guaranteed” frequency, but in practice this has almost no meaning today: except in very small form factors or with very poor cooling you’ll generally run at the max turbo indefinitely, and if you get hot enough or draw too much current you can go below normal anyways, so essentially all frequencies are “if conditions permit”.

        Historically and still to some extent today, the normal frequency was important for the power management API the chip offers: they expose the ability to the OS to adjust the frequency between the min and normal frequencies, so normal was relevant there – for turbo speeds you had to let the hardware take control. Later on the chips offered more control over turbo rations too, but the interface (i.e., what MSRs you write and what you write) was totally different. These days the recommended mode of operation is “HWP” which is hardware performance management, essentially giving the CPU control over the whole frequency range (the P-states), so that distinction has most disapeared.

        I wanted to comment on the AVX2/AVX512 throttling too, since I think there is some misunderstanding above, but this is already long enough… 🙂

        I’m happy to add that part later if anyone is interested.

    2. Another question is… how certain are we that our software does not already use AVX instructions?

      1. Travis Downs says:

        It is pretty easy to prevent the compiler from emitting AVX2 in code you are compiling with the appropriate compiler flags, but that’s only part of the story – you also have to check any third party libraries you use, especially the C and C++ standard libraries which almost everyone uses.

        The C library especially is almost always implemented with AVX2 for methods like memcpy, and you’ll often get these faster methods even if you didn’t compile with AVX2 flags (or even if you compiled before AVX2 existed) through the magic of runtime dispatch (including the runtime linker IFUNC magic).

        Finally, even interrupts or other processes running on the same CPU (including at the same time on the sibling hyperthread) might decide to use AVX2, slowing down your whole CPU (the interrupt case is admittedly a bit of a stretch!).

        1. Right. Java certainly JIT compile code to use AVX if it detects that the processor supports it.

          1. Travis Downs says:

            Exactly, which is one of the tick marks in the column for “how a runtime-interpreted language like Java can be faster than a native compiled language like C”. That is, it can use CPU instructions that weren’t even invented when the source was compiled!

            1. Alex says:

              Good point, that never occurred to me!

  4. Mica says:

    Hi Daniel

    It would be really GREAT if you can make a “SIMD tutorial” for new comers.

    As you said, there very little information about how to use SIMD in practice.

    And please, if you decide to do so use “C” for simplicity 🙂


    1. I agree Mica.

  5. Amit Dhingra says:

    Hi Daniel,

    Is sending images and video files in base64 format in the html file through webAPI is a good approach in comparison with sending html,images,videos all in a zip file ?