28th November 2011, 12 min read

3 surprising facts about the computation of scalar products

18 thoughts on “3 surprising facts about the computation of scalar products”

Daniel Lemire says:

November 28, 2011 at 12:53 pm

@John

Thanks!
Zeno says:

November 28, 2011 at 12:54 pm

Interesting. Thank you for sharing your results!
Daniel Lemire says:

November 28, 2011 at 2:12 pm

@Aleksey

Thanks. I confirm your good results, I will update the blog post.
Daniel Lemire says:

November 28, 2011 at 3:03 pm

@John

I’m trying to update my results the best I can.

Who knew that a technical blog post could be so time consuming?
Daniel Lemire says:

November 28, 2011 at 3:39 pm

@Aleksey

I know, I don’t usually update posts so fast. Sorry.
Daniel Lemire says:

November 28, 2011 at 4:06 pm

@John

Good question. I wonder why it would make a difference for integer arithmetic. Something to do with overflows maybe?
Daniel Lemire says:

November 28, 2011 at 5:12 pm

@Anonymous

(1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:

sales = price * quantity

is a scalar product. Also, it is often a bottleneck.

Image processing is filled with scalar products. Computer graphics use scalar products all the time.

(2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.
Stanley Lee says:

November 28, 2011 at 5:35 pm

Wow, there are a lot of intricate details to know here that could be useful. Thanks for sharing!
John Regehr says:

November 28, 2011 at 12:44 pm

Nice to see measurement-based blogging :).

When compiling any code of this sort, you should try the Intel compiler as well.

10,000 reps may be enough for your simple rdtsc to give reliable results, but I wouldn’t count on it. It’s easy to do better:

http://blog.regehr.org/archives/330
Aleksey Bader says:

November 28, 2011 at 2:03 pm

It’s probably worth to enable -mavx optimization option on recent core i7.
I’ve got speed up for 32-bit integers: from 0.991442 to 0.438392.
Aleksey Bader says:

November 28, 2011 at 2:11 pm

BTW, for 64-bit integer with SSE2 enabled I have best result – 1.073550 (manually unrolled loop).
The rest is pretty similar.

gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
John Regehr says:

November 28, 2011 at 2:28 pm

Here’s what I get on an i7-920, which does not have AVX. “current-gcc” is from a few days ago, “icc” is Intel 12.1.0.

$ current-gcc -funroll-loops -Ofast scalar.c ; ./a.out
uint32 = 1.041528
uint64 = 4.266846
uint32 2×2 = 1.347658
uint64 2×2 = 4.050428
float = 0.786304
float 2×2 = 0.785491
double = 10.993460
double 2×2 = 5.679850

$ current-gcc -funroll-loops -Ofast -mno-sse2 scalar.c ; ./a.out
uint32 = 2.007275
uint64 = 2.011980
uint32 2×2 = 2.006285
uint64 2×2 = 2.013779
float = 0.942085
float 2×2 = 2.471436
double = 279.546434
double 2×2 = 142.930305

$ icc -fast scalar.c ; ./a.out
uint32 = 0.766303
uint64 = 3.629643
uint32 2×2 = 1.520190
uint64 2×2 = 3.605145
float = 0.512532
float 2×2 = 0.827342
double = 2.104804
double 2×2 = 2.390149
Aleksey Bader says:

November 28, 2011 at 2:37 pm

Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.
Pretty strange…
I would expect to see ymm registers utilization.
Aleksey Bader says:

November 28, 2011 at 3:18 pm

I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.
“In theory, the SSE instructions are ideally suited to the computation of scalar products.”
It’s true only for SOA data layout and there is no data dependency between loop iterations.
I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.
I’m surprised that gcc failed to optimize so common problem.
Aleksey Bader says:

November 28, 2011 at 3:28 pm

Oh…
Did see the latests update – you updating your post too quick for me.
I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.
Thanks.
John Regehr says:

November 28, 2011 at 3:42 pm

Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.
Anonymous says:

November 28, 2011 at 4:34 pm

“The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”

In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.

As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. 🙂
John Regehr says:

November 28, 2011 at 5:03 pm

Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…

Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.