, 8 min read
Fast computation of scalar products, and some lessons in optimization
10 thoughts on “Fast computation of scalar products, and some lessons in optimization”
, 8 min read
10 thoughts on “Fast computation of scalar products, and some lessons in optimization”
The usual advise I’ve read is to use a linear algebra library, like LAPACK, for this type of operations. Do you know what type of “trick” LAPACK et. al use?
@Weinstein
I have no idea what LAPACK uses. For various reasons, I rarely use numerical libraries.
@John
Indeed. Sometimes, it is almost as if the computations themselves were free, but the bandwidth is very tight.
@cb
In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days.
I agree. This was the main point of my post.
This is not really how most modern CPU’s work.
We can play semantic games, but that is not much fun. Effectively, while one multiplication is being computed, another multiplication begins so that even though 3 cycles are required to complete an integer multiplication, you can execute two multiplications in 4 cycles. I call this parallelism.
On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast.
The latency for multiplication on Intel processors is at least 3 cycles (for integers), and 1 cycle for additions. (Reference: http://www.intel.com/products/processor/manuals/, check “Intel 64 and IA-32 Architectures Optimization Reference Manual”). Floating point multiplications are more expensive (at least 5 cycles). We could get technical and dig into SSE instructions and so on, but that wasn’t my intent. The throughput is often 1 cycle, of course.
Another factor is that in a number crunching application, most of the time is spent moving data around, not crunching numbers. Clever software rearranges calculations to do as much with data as it can while the data is in cache.
Once again we are reminded of Knuth:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”
“It is fundamentally flawed because modern processors are superscalar. It may take 3 cycles to complete a multiplication, but while the multiplication is being computed, other operations can be started. The scalar product is highly parallelizable: while one multiplication is computed, other multiplications can be computed (in parallel).”
This is not really how most modern CPU’s work. They almost always only have one multiplier unit, because multipliers are somewhat large on the die.
On some chips multiplication has long latency (7-9 clocks) but still has throughput of 1 mul per clock because of pipelining.
Pipelining is in some ways similar to parallelism but is not quite the same thing.
On modern Intel chips multiplication simply takes only 1 clock, no parallelism required, multiplication is just fast. (division is another story – that’s still slow)
In any case, trying to count the clocks of arithmetic operations is not remotely the way to write fast code these days. Writing fast code is something like :
1. use a better algorithm (use threads)
2. rearrange memory layout for better cache use
3. make data more compact
4. remove branches
5. remove data dependency chains
6. use SIMD
7. use CUDA or SPUs or what have you
8. remove divisions, float-to-ints very expensive ops like that
we just never worry about things like multiplies, they are a tiny fraction of a rounding error in overall speed.
I think a better justification for algorithms that focus on reducing multiplications (as so often show up in literature in algebraic complexity theory) is: what if your “numbers” aren’t just single-precision ints? For example, what if the entries in your vector are polynomials? I believe (though I have not tested) that multiplying polynomials is significantly more time-consuming in practice than adding them. I believe this is also true for “big ints” (multi-precision integers, as used e.g. in cryptography), but again I have not tested this.
The point of your post is still well-taken, I just wanted to add that there may be other situations that arise where the multiplication-minimizing algorithm actually does perform better.
@Josh: yes, multiplying two big-ints is more expensive than adding them, exactly like polynomials. In fact, the two problems are pretty similar – bigints are “more or less” polynomials in the variable x=10…
Is that the 1968 Winograd algorithm? By the way, your math symbols are totally broken, which makes following the formulas really hard.