28th November 2011, 2 min read

3 surprising facts about the computation of scalar products

The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 Ã— 5 + 2 Ã— 3. We have strong incentives to compute the scalar product as quickly as possible. Here are a few facts that I find surprising:

Recent processors (e.g., Intel i7) can multiply and add a 32-bit number in a single CPU cycle or less. For each component in the arrays, we need to execute one multiplication and one addition. The latency of the multiplication is at least 3 cycles, meaning that we require 3 cycles to complete a multiplication. Similarly, additions require at least one cycle. Yet processors make aggressive use of pipe-lining: they execute several multiplications simultaneously so that they can produce one result every cycle.
If you work with 64-bit integers and use some recent GNU GCC compilers (e.g., 4.5), you should disable Streaming SIMD Extensions (SSE) for better speed. In theory, the SSE instructions are ideally suited to the computation of scalar products. Yet the throughput with 64-bit integers goes from 1.3 cycle per multiplication with SSE2 disabled to 3.4 cycles per multiplication with SSE2. There is an optimization bug in the otherwise excellent GNU GCC compiler. Update: According to the numbers provided by John Regehr, this problem also affects some Intel compilers.
When using SSE, 64-bit floating point numbers may be faster than 32-bit floating numbers. Years ago I was told to avoid 64-bit floating point numbers for performance reasons. It is not automatically good advice on all compilers especially if you require standards compliance.

For my tests, I initially used the flags “-funroll-loops -O3” on a recent Intel i7-2600 with the GNU GCC compiler version 4.5. In each instance, I have tested with and without manual loop-unrolling and I only report the best score (in cycles per multiplication). The C code is available.

computation	with SSE	SSE2 disabled (-mno-sse2)	with AVX (-mavx)
32-bit integers	1.0	1.1	0.5
64-bit integers	3.4	1.3	3.4
float	10	10	10
double	7.0	170	7.0

Upgrading to GCC 4.6.2 and replacing -O3 by the new -Ofast flag changes the results quite a bit at the expense of reliability (-Ofast disregards standards compliance).

computation	with SSE	SSE2 disabled (-mno-sse2)	with AVX (-mavx)
32-bit integers	1.0	1.1	0.5
64-bit integers	1.0	1.1	1.0
float	0.7	0.9	0.3
double	5.1	2.5	5.0

Further reading: See my previous post on this topic Fast computation of scalar products, and some lessons in optimization

Code: Source code posted on my blog is available from a github repository.