Daniel Lemire's blog

, 30 min read

AVX-512: when and how to use these new instructions

15 thoughts on “AVX-512: when and how to use these new instructions”

  1. Francois Piednoel says:

    You instruction stream density is what decides the frequency decrease you will have when you use AVX512. You frequency decrease can go from few 100Mhz to 1Ghz on the Xeon Gold. To understand how much you will lose in frequency, the Power Control Unit (PCU) will count the “unit of power” (it has a look up table for almost every operation in the SoC, including fabric and Cores). Some of the heavy instructions, like Fuse Multiply ADD (FMA) get a really high count of unit of energy, that is very likely that if you end up at 2.9Ghz on the Xeon, you are using ALOT of that. You are usually rewarded by the SIMD speed up of the 512bits if you have a high count of FMAs. Optimizing for AVX512 requires a good understanding of the instruction stream, and ASM optimization, AND a good understanding of how the PCU works. Here is Efi explaining how it was working in SandyB, the mechanisms have changed a little, but not enough to be not useful. https://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf
    Globally, if you are using instructions son of MMX, using int SIMD, if your stream is optimized properly you will end up around 3.5Ghz. IF you have a dense IS of SIMD FP (Instruction stream), you will end up at 2.9Ghz to 3.2Gz, Those numbers are only good on the Xeon Gold, if you using SKX, get to the UEFI settings, and equalize frequency of AVX2 and AVX512, and this will solve all of your problems. I never find a SKX (ExtremeEdition) that can not sustain the POR frequency doing AVX512, if you put it into a motherboard with a strong VR (Voltage regulation) and a 1000 Watt power supply.
    Good day !

    1. Travis Downs says:

      Are you saying there are more than three levels? Based on my tests and all the documentation I’ve seen there are only three levels, and when you enter them is more or less deterministic. The main remaining question mark is exactly how “dense” the wide FP instructions have to be, and how many you have to run, in order to trigger the L0 -> L1 and L1 -> L2 transitions.

      This puts aside actual chip-wide temperature, power or current throttling which is a different thing and doesn’t seem to kick in on the server chip we tested.

      1. Francois Piednoel says:

        Actually, they are not different thing , they are all linked inside the PCU, there are more than 3 levels if you experiment properly, when you get into the transitional phase where your code is dense, the PCU will adjust to try to shave the TDPs of your socket. This is why all dense workload using AVX512 do not all end up at 2.9GHz when doing so. You really need to get to understand the PCU if you want to understand the behaviors of the Xeons

        1. Travis Downs says:

          You might run into other types of throttling which cause the chips to deviate from the published turbo levels of the three licenses, especially if your cooling is inadequate or your chip has a low configured TDP, but in general our tests and the documentation seems to indicate that the three levels are largely what matters, especially for well-cooled, high TDP server chips.

          Note that there are three licenses, but the frequency levels depend also on the active core count, as described above, so the frequency is a two-dimensional matrix: on a 14-core chip like the 5120, there are 3 * 14 = 42 possible frequency levels. Note that these values are published by Intel!

          This is why all dense workload using AVX512 do not all end up at 2.9GHz when doing so.

          Based on our experiments, dense workloads ran at the expected frequency on all cores: which is only 2.9 GHz for 1 or 2 cores. For 9 or more cores, the frequency is 1.6 GHz, for example.

          Finally, there is also transition behavior, not described in this article, when entering and leaving the various licenses and also when changing core counts: but as far as I can tell this involves only throttling instruction dispatch and/or executing wide instructions on narrower units, and periods where the chip is halted, but not any new frequency levels.

          Do we agree yet?

          Note that you can run the same tests we did using avx-turbo.

  2. Ben says:

    Interesting article, I recently did some energy experiments on an Intel Skylake Gold 6154 based machine and also had to tackle these issues.

    I found the Microway database insightful on AVX-based frequency reductions.

    https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-skylake-sp-intel-xeon-processor-scalable-family-cpus/

    They imply that the chip is free to do whatever it wishes so long as it does not violate the TDP constraints. I’m not sure if this contradicts your findings with 3 discrete license states or not. Maybe that is how they choose to implement this feature. Although this wouldn’t take into account processor binning. Therefore I assumed it would be finer-grained than this.

    On our heavily vectorised (AVX-512 flops) code we actually found the frequency drop was not as large as initially feared. This could have had something to do with effective water-cooling which keeps the temperatures down.

    Would be so much easier if Intel gave some hints…rather than leaving it to educated guesswork!

    1. They imply that the chip is free to do whatever it wishes so long as it does not violate the TDP constraints.

      Can you point me at the exact quote where they imply that the chip is free to do anything?

      What is certainly true is that above and beyond downclocking, there is TDP-related frequency throttling.

      That is, you cannot be sure that the chip will run at the specified frequency. For example, if it gets too hot, it might run slower, certainly.

      I would not describe that as the chip being free to do whatever it wishes. That would be quite a painful design for software engineers to cope with.

      1. Travis Downs says:

        Perhaps Ben is referring to the charts in the section “AVX-512, AVX, and Non-AVX Turbo Boost”, which imply that each of the licenses cover a range of frequencies (with the range being very large in the case of the single-core frequencies).

        What these charts are showing is the the turbo speed for the license and core count at the top of each interval and the published “base frequency” at the bottom of each interval. Since there is only a single published base frequency (which doesn’t depend on core count – hence should work even for max cores), the bottom limit of each range is the same for both graphs.

        Essentially this is a claim that you’ll get performance somewhere between the base frequency and the turbo, frequency which is correct in principle! Intel only really guarantees operation at the base frequency and speeds above that are “opportunistic” so you may not always get them depending on various factors. In practice, the turbo speeds are very reliable, unless you are doing something weird or you are using a very low TDP chip: you usually get exactly the max turbo you are allowed to get, consistently. People purchase chips based on that behavior too: you’d be pretty annoyed if for some reason you didn’t hit the published turbo frequencies.

        If someone has any evidence that Skylake Xeon chips consistently run at somewhere other than the published max turbo for the license and core count, with the standard TDP configuration (i.e., not setting a lower than expected TDP in the BIOS/firmware), I’d like to see it!

        1. Ben says:

          @Travis that was exactly the point I was trying to make, although I agree I was a bit ambiguous.

          Intel publishes essentially a list of guarantees for the maximum and minimum frequencies on the various possible workloads: number of cores executing different type of instructions. The microway link I posted publishes these in a series of box plots. So when you buy a 3GHz chip this is the minimum for non vectorised operations on a single core I believe. Even though it will usually be able to run at near enough the maximum “TurboBoost” frequency.

          These guarantees are made to ensure that even the worst quality chips they push out can make those frequencies while staying within the TDP. In our testing we found that it often did much better than advertised. It never had to go down to the 1.6GHz AVX-512 minimum frequency.

          With this new knowledge of the licenses, I will try to find some time to go back over the data and see if there are any “clustering points” around these license frequencies.

          Maybe they are used as a hint to the core as to a rough frequency and then the PCU does the rest?

  3. Travis Downs says:

    Intel publishes essentially a list of guarantees for the maximum and minimum frequencies on the various possible workloads: number of cores executing different type of instructions.

    I have never seen published minimum values on a per-core basis. The microway link just uses the same minimum “base frequency” for the single-core and all-cores case, which is especially unrealistic (i.e., will never happen unless you put your CPU in an oven) for the single-core case. If you have a link to minimum per-core frequencies published by Intel, I’d like to see it!

    As far as I can tell, they publish only max turbo frequencies, and these are also the ones you care about because you’ll usually be running at that speed.

    Maybe they are used as a hint to the core as to a rough frequency and
    then the PCU does the rest?

    Here’s my rough model of how this works: first you have the deterministic published behavior described in this post, and also by Intel. This puts a hard cap on the max speed in a given configuration, and generally can’t be adjusted (outside of chips with “unlocked multipliers”. This works “deterministically” in a fairly simple way based on the core count + license charts, and is the same for every CPU of a given model. The only relevant numbers here are the “turbo” frequencies: I don’t think the base frequency ever comes into play. In general, the CPU will “try” to run at the speed looked up from the tables.

    These tables mean that CPUs will generally run “slower” if they are executing heavier instructions, or are running with more cores, but I wouldn’t tall this throttling: there are just different design speeds for different points in this matrix. So in some extend they are the modern equivalent of the marketed CPU speeds, but for marketing and sanity reasons Intel is of course not printing this matrix “on the box”.

    Then, behind that, you have a complex PCU layer which has several feed-forward and feed-back mechanisms to monitor the predicted and/or actual power, current and temperature levels and to potentially apply additional throttling based on various thresholds. This can only slow down your chip compared to the design speeds, never speed it up.

    For example, it may measure the instantaneous current and use that to calculate instantaneous power, and then insure that the power over some interval doesn’t exceed some threshold. It may use different thresholds for different time periods too: you may be allowed to run with a TDP of 130W for 20 seconds, but only 100W longer term. This type of speed adjustment by the PCU I would label as “throttling”. It may be implemented by changing the frequency/voltage (p-state change), or perhaps by clock gating to change the duty cycle (more instantaneous and fine-grained, but less efficient longer term).

    In addition to power the PCU will monitor temperature as kind of a last result: the TDP throttling should prevent the termperature from rising too high in normal circumstances, but it’s no guaranteed, e.g., if the ambient temperature is high, the cooling system isn’t working properly (vent blocked, etc) or the values are just too optimistic: if the temperature reaches some threshold, usually right around 100C you’ll again get throttling.

    Many of these “throttling” behaviors are somewhat configurable: the manufacturer (might be the system integrator, or the motherboard manufacturer or the cloud provider or whatever in various scenarios), can actually set some of these values, rather than use the default. For example a system with better than average cooling can set a higher TDP values: this doesn’t allow it to run faster than the max set by the matrix, but it might delay or eliminate TDP related throttling by the PCU: since that’s using the configurable thresholds. Similarly a cool & quiet system might want to set lower TDPs.

    These kinds of configurable thresholds are one reason you see CPU performance differences between different motherboards/systems even with the identical CPU (not the only reason: they have have slightly different base clocks too).

    Unlike the matrix lookup, this behavior isn’t really going to be deterministic from an end user point of view as it depends on many fine-grained internal and external factors. However, it is at least very observable: the PCU sets various bits in MSRs indicating what it did: throttling due to TDP limits, current throttling, temperature-throttling, etc – you can determine pretty much if any of them happened over any interval. Intel XTU on Windows shows some of this, but if you dig into the MSR you can get even more info.

    Since you have these two quite different methods to determine the actual frequency, the question naturally arises: which is more important? Are you usually running at the “max turbo” that is described by the license + active core matrix? Or are you usually running in some kind of throttled state described in the second half of my description above? It depends on the chip. For server chips, like the ones we’ve tested for this article with AVX-512, it is my impression and observation that they can usually run indefinitely at their “max turbo” without throttling, at least with reasonable cooling (which cloud providers probably have). This was true also on the Skylake W-2104 that we tested, a “workstation” chip (but it’s a small, cheap one). So in those scenarios I think you should consider the max turbo the dominant factor.

    The situation is different for chips and devices with restricted TDP. A 4-core 15W laptop chip is probably not going to be able to run with all cores at max speed indefinitely: you’ll exceed the TDP. It probably will run that way for a short burst, since Intel has these thresholds where you can exceed your TDP for 10 seconds or something like that, but then it will slow down to keep the TDP at the configured value.

    This kind of behavior is common for the chips that go in small and light devices like thin laptops and tablets. You can see it in the “throttling” tests that some reviews provide, showing performance over time for a sustained loads. It’s super prevalent in phones (which of course are not x86) and usually the dominant factor for sustained CPU loads over a minute or so.

    Not all laptop chips fall into this category though: I have a 45W i7-6700HQ (Skylake) and as far as I can tell it is OK to run pretty much any load on all cores at the max turbo (which on this chip varies only by core count, not by “license” – so that includes AVX2 FMA operations), although the fans do spin up and the CPU cores sit at about 90C).

    Well that’s my mental model anyways!

  4. Travis Downs says:

    After all that, I forgot to add the part about chip to chip variation that I was building towards…

    So the matrix will be the same for every chip of a given model, and the PCU algorithms and thresholds will likely also be the same, but two chips of the same model may run at slightly different frequencies and slightly different power draws depending on characteristics of that particular wafer, where the chip appeared in the wafer and minor process changes over time.

    So one chip might draw more power running the same load than another chip with the same model, which in the case that PCU throttling occurs, could mean that one chip runs faster. Of course this can also happen due to external factors, such as a server’s position within the rack, the socket’s position relative to the airflow, whatever.

    The key is that this should only apply to “throttling” type scenarios, and not to the usual operating mode that normally applies to servers. So you often won’t see any differences since you often don’t see throttling.

    For thin laptops and so on, you might be “always throttling” so it’s important there – but the external factors on laptops are huge and probably overwhelm any chip-to-chip variation: if you have it on your lap, the outside temperature, whether the GPU is being used, if the bottom vents are blocked, etc.

    About binning in general, remember that Intel is producing something like 20+ Xeon models from only 3 different dies, so there is already a huge opportunity for binning between the models since one die can make many different chips depending on its frequency response characteristics and any faulty cores.

    1. Ben says:

      A comprehensive reply, thanks! I agree, and I think your mental model matches mine!

      I have never seen published minimum values on a per-core basis. The microway link just uses the same minimum “base frequency” for the single-core and all-cores case, which is especially unrealistic (i.e., will never happen unless you put your CPU in an oven) for the single-core case. If you have a link to minimum per-core frequencies published by Intel, I’d like to see it!

      Sorry I realised Microway didn’t publish it for the Skylake line (not sure why this is the case). In my experiments I was comparing it to a Broadwell chip for which they did:
      https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/

      Under: “Top Clock Speeds for Specific Core Counts”

      I see no reason why the situation is not similar for Skylake and newer models.

      The key point from your post that resonated with me is that Intel has to be flexible. They do not know where this chip will end up. It might end up living the life of luxury in a watercooled rack on its own or crammed in to an overheating server next to 20 others! That’s why I find it hard to believe this “license” idea works. It’s impossible to prescribe a frequency for a certain instruction unless you know the operating conditions.

      1. Travis Downs says:

        That’s why I find it hard to believe this “license” idea works. It’s
        impossible to prescribe a frequency for a certain instruction unless
        you know the operating conditions.

        To be clear, I think the license idea works, and while it won’t always be the thing that determines the actual speed, in many situations it will often or nearly always be the thing that determines the speed, to the extent that for your analysis you can use the simplified model where the license model is the only thing that exists (this is particularly true for the server chips we are discussing).

        Most of the buyers who shell out thousands for a Skylake Xeon are likely to be professionals who will install it in a properly cooled rack, I think. In fact, the primary consumers to date are no doubt the big 3 (ish) cloud providers (Amazon, Google, Microsoft).

        Note that the license model isn’t something that we made it: Intel documents it themselves and these terms come straight from their documents. They even have performance counters in SKX that will tell you exactly what fraction of the time you spent in license 0, 1 or 2 as mentioned above (the CORE_POWER.LVL0_TURBO_LICENSE and related events).

    2. Francois Piednoel says:

      You understand that I was one of the performance Architect of Core and that I was known to be a very good software optimizer, just to make sure before we go forward.

      so, yes, there are hard limits, and it is not up for discussion, that are fact you can find here: https://en.wikichip.org/wiki/intel/xeon_gold/6154 (For the hard limits)

      The PCU does try to get you to your maximum TDP to maximize performance, so, if your instruction stream does not include any particular instruction set, you are likely to end up higher than the supposed bas3d frequency, for example, when you run cinebench, you do run one or 2 bins higher than the 3.7Ghz that Xeon is supposed to be limited to, at the beginning, then, later, it may drop, depending if your cooling can keep up with the energy produced.
      Then , when you get to AVX2 or AVX512, you have the same mechanism in place, the PCU knows the floor of the respective instruction set, based on how many cores are active, and how dense is your instruction stream.
      Then, The PCU will regulate, if you are using SIMD 512 bits adds, for example, you will not drop to the minimum 2.8Ghz, you will operate around 3.2Ghz (Just tested it)
      IF you add FMA with no dependancies between the 2 FMAs, in 512 bits, you will go down to 2.8Ghz (Just tested it too)

      Those mechanisms are working this way, any other way to look at this is voodoo stuff.
      For the fun of it , a video of me speaking of a top end config of Intel when I was working there. https://www.youtube.com/watch?v=2W_79ZUyYWw

      1. BeeOnRope says:

        I think we disagree, but it’s hard for me to be certain, because I’m not totally clear on what you are claiming – so it’s possible we don’t disagree at all.

        When you say “so, if your instruction stream does not include any particular instruction set, you are likely to end up higher than the supposed bas3d frequency, for example, when you run cinebench, you do run one or 2 bins higher than the 3.7Ghz that Xeon is supposed to be limited to”, in the reference to “supposed bas3d frequency” are you talking about the turbo frequencies (which are 3.7, 3.6, 3.5 GHz for 1 active core for L0,1,2 respectively for the Gold 6154 you linked to), or are you talking about the “base” frequency of 3.0, 2.6, 2.1 given in the base column?

        If you talking about the latter (“base frequency”) then I think we agree on least that part: the CPU will generally always be running in one of the turbo speeds which is greater than the base frequency. Something would have to quite unusual (configuration-wise or hardware) for the chip to run in a sustained manner only at the base frequency, which is much lower than even the wost-case turbo frequencies.

        If you are talking about the former (turbo frequencies), then your claim is that the chip can run for some period above the turbo frequency for its license, right? That seems remarkable (and in direct contradiction to your previous paragraph where you say these are “hard limits”) – and I’d like to see a reproducible test that shows it! You can start with avx-turbo as a base or do it from scratch, or use existing tools – just make it open and reproducible.

        The only time-based effect I’m aware of that kind of aligns with your description is the ability to exceed the long-term TDP threshold for various time periods, e.g., run up to 140W on a 100W TDP chip for 1 second an up to 125W for 14 seconds or something like that. That’s a separate mechanism and it doesn’t let you go above the turbo frequency matrix, it just lets you exceed the TDP for a short period, effectively using the thermal mass of chip and cooling solution as a buffer to absorb heat above the long-term cooling capability. These values are configurable in the BIOS/firmware. This feature is especially useful and commonly triggered in low-TDP chips like < 40W thin-and-light laptop chips.

        Then, The PCU will regulate, if you are using SIMD 512 bits adds, for
        example, you will not drop to the minimum 2.8Ghz, you will operate
        around 3.2Ghz (Just tested it)

        Can you share your test? Or at least describe the inner loop in terms of what type of adds and how they were dependent. This wouldn’t be surprising – as described above, most AVX-512 kernels will run in the L1 license, which is 3.3 GHz: only if you have a “dense enough” sequence of heavy AVX-512 instructions will you drop to L2. I wouldn’t be surprised to find out that the measurement of “dense enough” depends in a fine-grained way on the actual instructions. I would be surprised in the CPU uses a frequency higher than the turbo frequency for the current licenses and active core count. I would also be surprised if the CPU selects dynamically a frequency lower than the max-turbo frequency, except where max TDP throttling or another type of throttling is occurring. Perhaps if you run the 6154 at 18 active cores, with a heavy AVX-512 load you get TDP throttling, and in that case I’d completely agree that you can see lower frequencies: but I haven’t run into TDP throttling on the server chips I’ve tested (which doesn’t include the 6154) or that other people have tested with avx-turbo and provided the results.

        Those mechanisms are working this way, any other way to look at this
        is voodoo stuff.

        You should be more specific about what you disagree with (and try to use precise language in order to reduce mis-understandings) – but I’m quite convinced it’s not voodoo. This post is just a restatement of how Intel themselves describes this working – both in marking materials and technical documents. They themselves publish the frequency matrices! They invented terms we use here like “license” and they have performance counters which use these terms and show you exactly how many cycles you spend running in each license.

        More importantly, our testing code is completely open and our results can be reproduced by anyone. Our results line up exactly with how Intel describes the system working, and are generally precisely and exactly reproducible. I welcome you to provide reproducible evidence to the contrary (and first to be specific about what you disagree about), rather than vague appeals to authority or youtube links.

  5. I spent a week to collect all public info available for this topics online, to make sure I was not pushing intel confidential info, and I did not:

    I think you do not understand the Turbo Max 3.0 at all … so, I recommend that you go and read the link attached. (see here that Xeon Gold support Turbo 3.0 ( https://ark.intel.com/products/120492/Intel-Xeon-Gold-6130-Processor-22M-Cache-2_10-GHz)

    Turbo 3.0 Max in more detail:
    https://www.pcper.com/reviews/Processors/Intel-Core-i7-6950X-10-core-Broadwell-E-Review/Intel-Turbo-Boost-Max-Technology-3

    Then, please read about Speedshift:
    https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7 (supported by Xeon Gold too)

    Then, read this:
    https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7 (the code in this article here was written by my peer fellow Principal Engineer in my group then)

    Now, important point to notice in the definition of turbo boost:
    https://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html

    “Availability and frequency upside of Intel® Turbo Boost Technology 2.0 state depends upon a number of factors including, but not limited to, the following:
    Type of workload
    Number of active cores
    Estimated current consumption
    Estimated power consumption
    Processor temperature”

    AND PLEASE NOTICE “but not limited to” part of it …

    Then, when you are done with this, you have to go and read carefully how does Turbo 2.0 Here: https://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf

    Then, understand that Turbo 3 and SpeedShift/SpeedSpeed and Turbo2.0 are stack on the top of each other.

    So, AVX512 has the transitional mode that you keep decline to exist. you can run a medium dense instruction AVX512 stream and not end up at the lowest frequency attributed the AVX512 by the frequency table, and this is how it works, and if you do not agree with all of those, well, can’t help you.