, 9 min read
AVX-512: when and how to use these new instructions
Our processors typically do computations using small data stores called registers. On 64-bit processors, 64-bit registers are frequently used. Most modern processors also have vector instructions and these instructions operate on larger registers (128-bit, 256-bit, or even 512-bit). Intel’s new processors have AVX-512 instructions. These instructions are capable of operating on large 512-bit registers. They have the potential of speeding up some applications because they can “crunch” more data per instruction.
However, some of these instructions use a lot of power and generate a lot of heat. To keep power usage within bounds, Intel reduces the frequency of the cores dynamically. This frequency reduction (throttling) happens in any case when the processor uses too much power or becomes too hot. However, there are also deterministic frequency reductions based specifically on which instructions you use and on how many cores are active (downclocking). Indeed, when any 512-bit instruction is used, there is a moderate reduction in speed, and if a core uses the heaviest of these instructions in a sustained way, the core may run much slower. Furthermore, the slowdown is usually worse when more cores use these new instructions. In the worst case, you might be running at half the advertised frequency and thus your whole application could run slower. On this basis, some engineers have recommended that we disable AVX-512 instructions default on our servers.
So what do we know about the matter?
- The term “AVX-512” can describe instructions operating on various register lengths (128-bit, 256-bit and 512-bit). When discussing AVX-512 downclocking, we mean to refer only to the instructions acting on 512-bit registers. Thus you can “safely” benefit from many new AVX-512 instructions and features such as mask registers and new memory addressing modes without ever worrying about AVX-512 downclocking, as long as you operate on shorter 128-bit or 256-bit registers. You should never get any downclocking when working on 128-bit registers.
- Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).
- There are heavy and light instructions. Heavy instructions are those involving floating point operations or integer multiplications (since these execute on the floating point unit). It seems like the leading-zero-bits AVX-512 instructions are also considered heavy. Light instructions include integer operations other than multiplication, logical operations, data shuffling (such as vpermw and vpermd) and so forth. Heavy instructions are common in deep learning, numerical analysis, high performance computing, and some cryptography (i.e., multiplication-based hashing). Light instructions tend to dominate in text processing, fast compression routines, vectorized implementations of library routines such as
memcpy
in C or System.arrayCopy in Java, and so forth. - Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies written on the box), license 1 (L1) is slower and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle. Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.On server-class processor, the downclocking is determined on a per-core basis based on the license and the total number of active cores, on the same CPU socket, irrespective of the license of the other cores. That is, to determine the frequency of core under downclocking, you need only to know its license (determined by the type of instructions it runs) and count the number of cores where code is running. Thus you cannot downclock other cores on the same socket, other than the sibling logical core when hyperthreading is used, merely by running heavy and sustained AVX-512 instructions on one core. If you can isolate your heavy numerical work on a few cores (or just one), then the downclocking is limited to these cores. On linux, you can control which cores are running your processes using tools such as taskset or numactl.
You will find tables online like this one, for the Intel Xeon Gold 5120 processor…
mode | 1 active core | 9 active cores |
---|---|---|
Normal | 3.2 GHz | 2.7 GHz |
AVX2 | 3.1 GHz | 2.3 GHz |
AVX-512 | 2.9 GHz | 1.6 GHz |
We have chosen to only include two columns. The frequency behavior is the same for 9 to 12 cores and is the worst case for the L2 license. When you have more than 9 active cores, there is no further downclocking documented for the L2 license.
These tables are somewhat misleading. The row “AVX-512” (the L2 license) really means “sustained use of heavy AVX-512 instructionsâ€. The row “AVX2” (L1 license) includes all other use of AVX-512 instructions and heavy AVX2 instructions. That is, it is wrong to assume that the use of any AVX-512 instruction puts the cores into the frequency indicated by the AVX-512 row.
These tables do give us some useful information, however:
- a. These tables indicate that frequency downclocking is not specific to AVX-512. If you have many active cores, you will get downclocking in any case, even if you are not using any AVX-512 instructions.
- b. If you just use light AVX-512, even if it is across all cores, then the downclocking is modest (15%).
- c. If you are doing sustained heavy numerical work while many cores are active, then the downclocking becomes significant on these cores (~40%).
Should you be using AVX-512 512-bit instructions? The goal is never to maximize the CPU frequency; if that were the case people would use 14-core Xeon Gold processor with a single active core. These AVX-512 instructions do useful work. They are powerful: having registers 8 times larger can allow you to do much more work and far reduce the total number of instructions being issued. We typically want to maximize the amount of work done per unit of time. So we need to make engineering decisions. It is not the case that a downclocking of 10% means that you are going 10% slower, evidently.
Here are some pointers:
- Engineers should probably use tools to monitor the frequency of their cores to ensure they are running in the expected license. Massive downclocking is then easily identified. For example, the perf stat command on Linux can be used to determine the average frequency of any process, and finer grained details are available using the CORE_POWER.LVL0_TURBO_LICENSE event (and the identical events for
LVL1
and LVL2). - On machines with few cores (e.g., standard PC), you may never get the kind of massive downclocking that we can see on a huge chip like the Xeon Gold processor. For example, on on Intel Xeon W-2104 processor, the worse downclocking for a single core is 2.4 GHz compared to 3.2GHz. A 25% reduction is frequency is maybe not an important risk.
- If your code at least partly involves sustained use of heavy numerical instructions you might consider isolating this work to specific threads (and hence cores), to limit the downclocking to cores that are taking full advantage of AVX-512. If this is not practical or possible, then you should mix this code with other (non-AVX-512 code) with care. You need to ensure that the benefits of AVX-512 are substantial (e.g., more than 2x faster on a per cycle basis). If you have AVX-512 code with heavy instructions that runs 30% faster than non-AVX-512 on a per-cycle basis, it seems possible that once it is made to run on all cores, you will not be doing well.For example, the openssl project used heavy AVX-512 instructions to bring down the cost of a particular hashing algorithm (poly1305) from 0.51 cycles per byte (when using 256-bit AVX instructions) to 0.35 cycles per byte, a 30% gain on a per-cycle basis. They have since disabled this optimization.
- The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.
- Library providers should probably leave it up to the library user to determine whether AVX-512 is worth it. For example, one may provide compile-time options to enable or disable AVX-512 features, or even offer a runtime choice. Performance sensitive libraries should document the approach they have taken along with the likely speedups from the wider instructions.
- A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes agressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops. The Intel icc compiler and gcc only seem to generate AVX-512 instructions for this test with non-default arguments: -qopt-zmm-usage=high for icc, and -mprefer-vector-width=512 for gcc. In fact, for most code, such as the generated copies, gcc seems to prefer to use 128-bit registers over 256-bit ones. MSVC currently (up to version 2017) doesn’t support compiler generated use of AVX-512 at all, although it does support use of AVX-512 through the standard intrinsic functions.
From the compiler’s perspective, deciding to use AVX-512 instructions is difficult: they often provide a reasonable local speedup, but at the possible cost of slowing down the entire core. If such instructions are frequent enough to keep the core running in the L1 license, but not frequent enough to produce enough of a speedup to counteract the slowdown, the program may run slower overall after you recompile to support AVX-512. It is hard to give general recommendations here beyond compiling your program both with and without AVX-512 and benchmarking in a realistic environment to determine which is faster. Because of the large variation in AVX-512 behavior across active core counts and Intel hardware, one should ensure they match these factors as closely as possible when testing performance.
Future work:
- It seems that there is a market for a tool that would monitor the workload of a server and identify when and why downclocking occurs.
- Operating systems or application frameworks could assign threads to specific cores according to the type of instructions they are using and the anticipated license.
Final words: AVX-512 instructions are powerful. With great power comes great responsibility. It seems unwarranted to disable AVX-512 by default at this time. Instead, the usual engineering evaluations should proceed.
Credit: This post was co-authored with Travis Downs.