11th August 2011, 8 min read

Fast computation of scalar products, and some lessons in optimization

10 thoughts on “Fast computation of scalar products, and some lessons in optimization”

Alejandro Weinstein says:

August 11, 2011 at 9:42 am

The usual advise I’ve read is to use a linear algebra library, like LAPACK, for this type of operations. Do you know what type of “trick” LAPACK et. al use?
Daniel Lemire says:

August 11, 2011 at 9:57 am

@Weinstein

I have no idea what LAPACK uses. For various reasons, I rarely use numerical libraries.
Daniel Lemire says:

August 11, 2011 at 10:05 am

@John

Indeed. Sometimes, it is almost as if the computations themselves were free, but the bandwidth is very tight.
Daniel Lemire says:

August 11, 2011 at 11:30 am

@cb

In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days.

I agree. This was the main point of my post.

This is not really how most modern CPU’s work.

We can play semantic games, but that is not much fun. Effectively, while one multiplication is being computed, another multiplication begins so that even though 3 cycles are required to complete an integer multiplication, you can execute two multiplications in 4 cycles. I call this parallelism.

On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast.

The latency for multiplication on Intel processors is at least 3 cycles (for integers), and 1 cycle for additions. (Reference: http://www.intel.com/products/processor/manuals/, check “Intel 64 and IA-32 Architectures Optimization Reference Manual”). Floating point multiplications are more expensive (at least 5 cycles). We could get technical and dig into SSE instructions and so on, but that wasn’t my intent. The throughput is often 1 cycle, of course.
John says:

August 11, 2011 at 9:58 am

Another factor is that in a number crunching application, most of the time is spent moving data around, not crunching numbers. Clever software rearranges calculations to do as much with data as it can while the data is in cache.
Tom Crick says:

August 11, 2011 at 10:08 am

Once again we are reminded of Knuth:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”
cb says:

August 11, 2011 at 11:00 am

“It is fundamentally flawed because modern processors are superscalar. It may take 3 cycles to complete a multiplication, but while the multiplication is being computed, other operations can be started. The scalar product is highly parallelizable: while one multiplication is computed, other multiplications can be computed (in parallel).”

This is not really how most modern CPU’s work. They almost always only have one multiplier unit, because multipliers are somewhat large on the die.

On some chips multiplication has long latency (7-9 clocks) but still has throughput of 1 mul per clock because of pipelining.

Pipelining is in some ways similar to parallelism but is not quite the same thing.

On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast. (division is another story – that’s still slow)

In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days. Writing fast code is something like :

1. use a better algorithm (use threads)
2. rearrange memory layout for better cache use
3. make data more compact
4. remove branches
5. remove data dependency chains
6. use SIMD
7. use CUDA or SPUs or what have you
8. remove divisions, float-to-ints very expensive ops like that

we just never worry about things like multiplies, they are a tiny fraction of a rounding error in overall speed.
Josh says:

August 11, 2011 at 11:58 am

I think a better justification for algorithms that focus on reducing multiplications (as so often show up in literature in algebraic complexity theory) is: what if your “numbers” aren’t just single-precision ints? For example, what if the entries in your vector are polynomials? I believe (though I have not tested) that multiplying polynomials is significantly more time-consuming in practice than adding them. I believe this is also true for “big ints” (multi-precision integers, as used e.g. in cryptography), but again I have not tested this.

The point of your post is still well-taken, I just wanted to add that there may be other situations that arise where the multiplication-minimizing algorithm actually does perform better.
Federico Poloni says:

August 11, 2011 at 2:38 pm

@Josh: yes, multiplying two big-ints is more expensive than adding them, exactly like polynomials. In fact, the two problems are pretty similar – bigints are “more or less” polynomials in the variable x=10…
Tobias Becker says:

September 15, 2020 at 12:28 pm

Is that the 1968 Winograd algorithm? By the way, your math symbols are totally broken, which makes following the formulas really hard.