30th December 2016, 8 min read

Can your C compiler vectorize a scalar product?

11 thoughts on “Can your C compiler vectorize a scalar product?”

G.T. says:

December 30, 2016 at 12:10 pm

Can gcc similarly vectorize comlpex version of this code?
1. Daniel Lemire says:
  
  December 30, 2016 at 5:47 pm
  
  GCC can vectorize, but not as well as clang.
Leonid Boytsov says:

December 30, 2016 at 3:42 pm

Ohhh wow. This is interesting from several points of view. We have been trying for a while to speed up scalar product with AVX. The gains were only marginal (say 10-20%). However, this version is, indeed, faster. So, this is the first piece of code useful to me, where AVX is much faster.

Second, auto-vectorization is actually quite unreliable. For one thing, only very simple things can be vectorized. For another, what I noticed that it varies quite a bit accross compilers. Your example is great, because the new Clang does use a new command, the gcc (4.x and 5.x alike) does not! What is worse, older versions of Clang sometimes refused to vectorize even scalar products and similar simple things. So, basically, you shouldn’t rely on the compiler.

Third, it is obvious that when things became so messy, the Intel architecture is up for a huge redesign. They can’t pile one set of crutches over another one. SIMD should be implemented in a more generic fashion.
1. Arun Nair says:
  
  December 30, 2016 at 4:52 pm
  
  @Leonid:
  I’m not a compiler expert, but in general, a requirement for autovectorization is the absence of loop-carried dependencies across adjacent loops (i.e. one iteration of the loop should be independent of the other). In this case, the compiler will unroll the loop and group instructions together in a SIMD fashion. One has to do the same thing to get one’s code accelerated on a GPGPU (that, and more).
  
  Intel’s icc compiler generally does a far better job of autovectorization than clang or gcc. Perhaps it identifies vectorization opportunities better than clang/gcc. It does vary greatly between versions of compilers presumably because these capabilities are getting added over time.
  
  Finally, stating the obvious, but vectorization will only really help you if your code is bottlenecked by compute, and not waiting on memory.
  1. Daniel Lemire says:
    
    December 30, 2016 at 5:45 pm
    
    Intel’s icc compiler generally does a far better job of autovectorization than clang or gcc.
    
    Though I do not report my results with Intel’s icc, I tested it and it did no better than GCC. They are both bested by clang.
    1. Arun Nair says:
      
      December 30, 2016 at 7:16 pm
      
      Ah, yes, I should’ve compared the disassembly in the compiler explorer itself. Since sum is a loop-carried dependence, the common loop-unrolling trick doesn’t work. From glancing at the disassembly, clang’s output does a vectorized reduction until the iteration count is less than 32, and then switches to scalar for the final reduction.
      
      As a very basic experiment, if you replace “sum” in your code with another array that is indexed by variable “k” in an outer for loop, or pass sum as a pointer into the function, the vectorization goes away. It may be due to the conservatism around pointer aliasing, or that clang is looking for a very specific pattern that looks like a reduction. Even so, very impressive.
      
      In general, however, my experience has been that icc does (or at least did a year ago) vectorize much better than gcc and clang in the general case.
      1. Arun Nair says:
        
        December 30, 2016 at 7:28 pm
        
        PS: If you don’t pass the outer array (indexed on k, as noted above) as a parameter (malloc internally and return a pointer), it vectorizes fine. So, I think it’s the passing a pointer part that trips it up.
        
        Thanks, I learned something today 🙂
Arun Nair says:

December 30, 2016 at 5:02 pm

Daniel: Here’s a link to ffast-math. https://gcc.gnu.org/wiki/FloatingPointMath. You may get a different result from the unvectorized version of this code due to rounding issues in floating point (a*b+a*c != a*(b+c) in FP). Vectorization may result in arithmetic order being changed (mathematically correct if you had infinite precision FP). It also treats denormals (very small fractions) as zero. Most scientific code is OK with this, but it’s something to be aware of.
me says:

January 3, 2017 at 10:04 pm

Many data sets will be low-dimensional (say, less than 8 dimensions), or will use spare representations.
1. Daniel Lemire says:
  
  January 4, 2017 at 2:12 am
  
  This is a microbenchmark but some applications will definitively benefit from vectorized scalar products.
2. Leonid Boytsov says:
  
  January 12, 2017 at 1:51 am
  
  For sparse representation, vectorization is also possible (see, e.g., https://github.com/searchivarius/nmslib/blob/master/similarity_search/src/distcomp_sparse_scalar_fast.cc), but it currently won’t be autovectorized.