Daniel Lemire's blog

, 13 min read

The dangers of AVX-512 throttling: a 3% impact on Xeon Gold processors?

10 thoughts on “The dangers of AVX-512 throttling: a 3% impact on Xeon Gold processors?”

  1. Nathan Kurz says:

    Were running this benchmark on your Skylake-X, or on the Packet Xeon Gold? The link that Travis provided on the previous post was useful: https://en.wikichip.org/wiki/intel/xeon_gold/5120

    It says that the expected single core drop for that particular Xeon Gold would be from 3200 MHz for “normal”, to 3100 MHz for “AVX2”, and then down to 2900 MHz for “AVX512”. The drop from “normal” to “AVX2” is suspiciously close to the 3% that you report.

    What becomes more interesting is that the expected drop when running on 9 or more cores is much more dramatic. At 9, 2700 Mhz is “normal”, 2300 MHz is “AVX2”, and 1600 MHz is “AVX512”.

    The linked explanation from that page seems to be a good overview of the frequency selection algorithm on Intel: https://en.wikichip.org/wiki/intel/frequency_behavior

    1. Thanks Nate.

      So the theory is that my benchmark simply does not use enough cores. But I already ran Vlad’s multicore benchmark on this same hardware and saw no effect.

      I am inviting reproducible benchmarks, I will gladly run them.

  2. Me says:

    It depends a lot on your cooling.

    On my laptop, using AVX2 heavily on a single core with turbo boost disabled is enough to cause the fan go to max and cause CPU throttling that usually only would kick in with multiple cores and turbo boost.
    But of course the cooling on a laptop isn’t designed for such load.

    1. foobar says:

      That’s different than possible necessity to clock down the core in order to cope with complex AVX-512 instructions. If all of this more or less boils down to cooling budget, it’s no wonder people see very different results depending on their setup.

      It might be interesting to see how CPU power usage differs between different variations of the test. I think modern CPUs quite fine grained abilities to measure used energy in Joules. It might be that although throughput of a CPU doesn’t vary much, energy efficiency may vary quite a bit – and that tends to be important to organisations running large numbers of servers.

  3. Paul Graydon says:

    I think you’re missing one of the key points that came out between the blog posts and the top comments on it, and that’s that the different tiers of processors, Bronze, Silver, Gold and Platinum all have very different performance characteristics when it comes to AVX-512.

    In the original cloudflare blog post, Vlad was testing against a Xeon Silver instance. Note the frequency table here: https://en.wikichip.org/wiki/intel/xeon_silver/4116
    Then compare it to the frequency table for that Xeon Gold chip you’re testing on:

    Note that the Silver drops maximum core speed below base speed as soon as you have any cores working on AVX-512. In the benchmarks Vlad was running, AVX-512 didn’t make up the majority of instructions, just single digit percentage, but the rest of the chip would end up throttled down to 1.8Ghz or as low as 1.4Ghz depending on how those AVX-512 requests landed. It only takes 9 active cores doing AVX-512 to get a Silver down to 1.4Ghz.

    In the Gold case, you have to get to at least 4 cores working on AVX-512 simultaneously before the maximum core speed drops below base speed, and even then the Gold retains a much higher clock rate for quite a ways across the board. Even with all cores working on AVX-512, the Gold chip you’re testing against doesn’t touch 1.4Ghz.

    Here’s a fun comparison. If you go to the more premium side, the systems the cloud provider I work for have Xeon Platinum 8160-somethings in them. Take a look at how drastically different the frequency table is for a Platinum 8168:
    https://en.wikichip.org/wiki/intel/xeon_platinum/8168. You have to get 17 cores simultaneously working on AVX-512 instructions to get the maximum core speed down below the normal core speed.

    Or to go further to the cheap side, the Bronze chips:
    As soon as you do any AVX-512 operations, your core speed drops by more than half, from 1.7Ghz to just 800Mhz. AVX2 isn’t great there, but it’s not as catastrophic to performance.

    What’s concerning here is that there isn’t immediately apparent ways of figuring this out at runtime. The chips will show up as all supporting AVX-512, and while that’s certainly accurate, I can easily see that I would likely want to avoid AVX-512 instructions on a Bronze and Silver, while embracing them on Gold and Platinum.

  4. Paul Graydon says:

    Not sure if my prior comment got swallowed by spam filters (probably, I did link to wikichip a whole bunch. Unfortunately something seems to be up with their DNS today so I’m writing my comments off of my memory of the frequency tables I saw last night).

    There are some very important differences between your benchmarking and Vlad’s:

    1) He’s using Xeon Silver chips. Bronze, Silver, Gold, and Platinum Xeon chips have very different responses to running AVX-512 instructions.

    2) He’s using a very mixed workload, where AVX-512 instructions only play a relatively small part (<10%) of the workload being carried out by the processor. The workload he’s using is simulating a webserver environment, where the processor is handling nginx, etc. workload as well as OpenSSL/cryptography stuff that’s leveraging AVX instructions. Silver chips are likely to be popular for similar workload. At least on the surface of things they strike a good price/$ point.

    With regard to frequency scaling:
    On the low end Bronze chips, just having 1 single core running an AVX-512 instruction is enough to drop the base frequency of the chip down to 800Mhz.

    On the Silver chip that Vlad was using, it’s not quite as dramatic, but still, the moment you run an AVX-512 instruction the core frequency drops noticeably. By the time you get more than 4 cores running it, it drops down from 2.2Ghz to 1.4Ghz.

    By the time you get to Gold, you can have several cores running AVX-512 instructions before the core frequency drops below base frequency. On Platinum you can have more than half the cores running AVX instructions and not see an impact.

    The unfortunate outcome of the approach Intel has taken here is that if you’re using Xeon Bronze or Silver chips, you’re likely going to want to avoid AVX-512, unless you’re doing purely AVX-512.

    What seems particularly dangerous about this approach thinking about things from a compiler perspective, is that no longer are you having to target architecture with optimisations, you need to be aware of specific models to figure out which approach is correct.

    I wonder if this may be a place where JIT’d languages might see a notable advantage, being able to gather a lot more information at run-time to guide optimisations?

    1. Travis Downs says:

      By the time you get to Gold, you can have several cores running
      AVX-512 instructions before the core frequency drops below base
      frequency. On Platinum you can have more than half the cores running
      AVX instructions and not see an impact.

      I think this part embeds a wrong assumption. Yes, you can have several cores running AVX-512 without seeing a drop below base frequency, but there is nothing really special about base frequency: it’s just the number on the box and Intel makes some kind of loose guarantees about it – but it is almost irrelevant for most code.

      Most cores are going going to be running at “turbo” frequencies most of the time, including “AVX-512 turbo” which may be above or below base – but the logical comparison is between AVX-512 turbo and the scalar (non-AVX) turbo, not between AVX turbo and base. Or more precisely, the right comparison is between the actual frequencies, both for AVX-using and non-AVX code and the turbo frequencies are good proxies for those.

      So regarding: “On Platinum you can have more than half the cores running AVX instructions and not see an impact.” – the bolded part is not true: almost chips will suffer a frequency impact relative to the scalar case as soon as they run some heavy AVX-256 or any AVX-512: only relative to the arbitrary base frequency is there no “impact”.

  5. Nathan Kurz says:

    On the low end Bronze chips, just having 1 single core running an
    AVX-512 instruction is enough to drop the base frequency of the chip
    down to 800Mhz.

    You probably understand this, but to clarify for others, “chip” here means just that particular core, and not all the cores on the CPU. The belief (likely correct) is that if you use single “heavy” AVX512 instruction (such as a 512-bit multiplication), that particular core will momentarily be slowed down to 800 MHz. Do we know how long “momentary” is here, and what transition penalty is?

    Thus if you are performing a task that is already at maximum IPC, you would expect a greater than 50% slowdown. On the other hand, if you are already slowed down by memory accesses, you might not notice anything even on Xeon Bronze. So while it should be possible to come up with a benchmark that shows the full impact, it might not be easy to come up with one that doesn’t feel “artificial”.

    1. Paul Graydon says:

      With full Intel Speed Step support, which got introduced with Skylake the latency in changing speed is reportedly ~35ms. Without it, it’s ~100ms. Even at 35ms to change speeds, that’s going to have an impact on performance.

      1. Travis Downs says:

        I have measured certainly types of transitions on Skylake and they are faster than ~35ms. For example, the frequency transitions between various turbo speeds (which are forced as various cores come out of sleep) take about ~20,000 cycles, which is about 8us, or more than 1000x faster than 35ms.

        Perhaps there are some types of transitions that take longer, however, e.g,. if the voltage needs to change.

        There is also another type of transition where the frequency doesn’t change, but the “upper lanes” of the ALUs are powered up, which occurs on some chips if you don’t run 256-bit instructions for a while, then you run one. Agner describes it at the end of this comment on his blog. This transition is also in the “microseconds” not “milliseconds” range.