Daniel Lemire's blog

, 10 min read

Optimizing compilers reload vector constants needlessly

7 thoughts on “Optimizing compilers reload vector constants needlessly”

  1. Philip Trettner says:

    There is one potential reason why GCC loads the constant twice: in the assembly you see the “jb .L2” -> “jb .L8” -> ret path that will never load the constant. At least from the assembly you cannot a priori say that each loop is entered at least once or even that if one is entered, the other is entered as well. If one loop is taken and the other is not, you would need the constant in the common parent of those blocks. That would be a pessimization of the “no loop is taken” path from the beginning.

    Some of the obvious optimizations like merging the two loops are also not really allowed because the ranges might overlap in memory. A simple __restrict doesn’t seem to help though.

    1. Even if you make sure that the constant is used at least twice… for sure (unconditionally), GCC still loads the constant twice…

      I agree that you can eventually get GCC to stop doing that with enough coddling… but that does not help at scale…

      #include <x86intrin.h>
      #include <stdint.h>
      void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) {
        // define the constant, 8 x 10001
        __m256i c = _mm256_set1_epi32(10001);
        const uint32_t *finalin1 = in1 + len;
        const uint32_t *finalin2 = in2 + len;
        {
          // load 8 integers into a 32-byte register
          __m256i x = _mm256_loadu_si256((__m256i *)in1);
          // add the 8 integers just loaded to the 8 constant integers
          x = _mm256_add_epi32(c, x);
          // store the 8 modified integers
          _mm256_storeu_si256((__m256i *)in1, x);
          in1 += 8;
        };
        for (; in1 + 8 <= finalin1; in1 += 8) {
          // load 8 integers into a 32-byte register
          __m256i x = _mm256_loadu_si256((__m256i *)in1);
          // add the 8 integers just loaded to the 8 constant integers
          x = _mm256_add_epi32(c, x);
          // store the 8 modified integers
          _mm256_storeu_si256((__m256i *)in1, x);
        };
        {
          // load 8 integers into a 32-byte register
          __m256i x = _mm256_loadu_si256((__m256i *)in2);
          // add the 8 integers just loaded to the 8 constant integers
          x = _mm256_add_epi32(c, x);
          // store the 8 modified integers
          _mm256_storeu_si256((__m256i *)in2, x);
          in2 += 8;
        }
        for (; in2 + 8 <= finalin2; in2 += 8) {
          // load 8 integers into a 32-byte register
          __m256i x = _mm256_loadu_si256((__m256i *)in2);
          // add the 8 integers just loaded to the 8 constant integers
          x = _mm256_add_epi32(c, x);
          // store the 8 modified integers
          _mm256_storeu_si256((__m256i *)in2, x);
        }
      }
      
  2. tarq says:

    I don’t this would have any kind of measurable impact at scale, you have other kind of issues such as push/pop the registers for each func call playing a stronger role here.
    We know that, doing higher level programming with C or C++ we leave this kind of control to the compiler.
    The issue is always the same: we trust the compiler to do a decent job, and if it’s not enough we dive into the assembly to squeeze the last bit of cycles we can.
    That shouldn’t be an issue if you’re already proficient writing SIMD code and looking at the assembly output.
    Always appreciate thoughts on that matter though 🙂

    1. I agree that it is unlikely to have a measurable impact on performance.

  3. Daniel Berlin says:

    There is definitely the “not sure if loop executes” problem mentioned earlier, which causes it to move it to execute once per loop because it thinks that guarantees it executes the minimum number of times it can (by the CFG).

    What is happening otherwise (IE if you make the loops constant-number-of-iterations for loops) is that constant propagation at a high level determines the vector is a constant, and propagates it forward into both loops, which is fine. Note that
    It is expected that later, after lowering, etc, something with a machine cost model will commonize it if necessary.

    The low level definitely knows it is constant in both cases.
    But it does not compute that commonizing it will save anything from what i can tell (I haven’t looked at every single pass dump at the RTL level to verify this, only a few that i would have expected to eliminate it).

    See https://godbolt.org/z/jxWKcnTT1
    This will show you the ccp1 pass, which propagates the constant forward
    if you swap over to the final rtl pass, you can see it knows it is equivalent to a constant
    Nothing in between CSE’s it, even if i turn on size optimization.

    This is likely related to believing that constants are free in most cases (at the high level this is definitely the right view. As I said, at the low level where it has a machine cost model, it’s weirder that it doesn’t eliminate it even though it’s a constant)

  4. Is there any difference in declaring:

    const register __m256i c = _mm256_set1_epi32(10001);

    Seems the *register* keyword might be ignored.

    1. The register keyword in such a context seems quite useless in my experience.