28th February 2018, 2 min read

Vectorized shifts: are immediates faster?

One thought on “Vectorized shifts: are immediates faster?”

Travis Downs says:

March 2, 2018 at 1:24 am

For some reason, Intel has preserved the older form of the instructions.

That’s not really weird at all. Of course once Intel introduces an instruction they are pretty much bound to support it forever, lest they break binary compatibility for any code using it.

As far as I know, Intel has never removed published x86 or x86-64 instructions once introduced (the same isn’t the same for AMD which due to market dynamics did backtrack on things like 3DNow! and the XOP instruction sets to align with the Intel extensions instead).

It should be worth noting there are actually three levels of “variability” in the shift instruction:

1) Compiled-in immediate (i.e,. the amount to shift is fixed at compile time and applies to all elements).
2) Runtime amount but the same for all elements.
3) Runtime amount and may be different for all elements.

Conceptually there is a fourth possibility “Compile-time immediate, may be different per element” but it has never been supported (indeed, the immediate would huge).

You compared (1) and (3) and indeed on Skylake they are documented to have identical performance. Oddly, the (2) variant is slower than either. It seems that on Skylake, the variant (2) is implement by one uop to broadcast the shift amount to every element in some temporary register and then uses the same uop as the fully-variable shift (3), slowing it down by the extra uop.

In earlier uarches like Haswell, the situation was different: (1) and (2) had the same performance as Skylake (i.e., 2 slower than 1), but the fully variable shifts (3) were considerably slower, taking 3 uops each. On that platform using the immediate-operand shifts can help a lot.