Daniel Lemire's blog

, 12 min read

3 surprising facts about the computation of scalar products

18 thoughts on “3 surprising facts about the computation of scalar products”

  1. @John


  2. Zeno says:

    Interesting. Thank you for sharing your results!

  3. @Aleksey

    Thanks. I confirm your good results, I will update the blog post.

  4. @John

    I’m trying to update my results the best I can.

    Who knew that a technical blog post could be so time consuming?

  5. @Aleksey

    I know, I don’t usually update posts so fast. Sorry.

  6. @John

    Good question. I wonder why it would make a difference for integer arithmetic. Something to do with overflows maybe?

  7. @Anonymous

    (1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:

    sales = price * quantity

    is a scalar product. Also, it is often a bottleneck.

    Image processing is filled with scalar products. Computer graphics use scalar products all the time.

    (2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.

  8. Stanley Lee says:

    Wow, there are a lot of intricate details to know here that could be useful. Thanks for sharing!

  9. John Regehr says:

    Nice to see measurement-based blogging :).

    When compiling any code of this sort, you should try the Intel compiler as well.

    10,000 reps may be enough for your simple rdtsc to give reliable results, but I wouldn’t count on it. It’s easy to do better:


  10. Aleksey Bader says:

    It’s probably worth to enable -mavx optimization option on recent core i7.
    I’ve got speed up for 32-bit integers: from 0.991442 to 0.438392.

  11. Aleksey Bader says:

    BTW, for 64-bit integer with SSE2 enabled I have best result – 1.073550 (manually unrolled loop).
    The rest is pretty similar.

    gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
    Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

  12. John Regehr says:

    Here’s what I get on an i7-920, which does not have AVX. “current-gcc” is from a few days ago, “icc” is Intel 12.1.0.

    $ current-gcc -funroll-loops -Ofast scalar.c ; ./a.out
    uint32 = 1.041528
    uint64 = 4.266846
    uint32 2×2 = 1.347658
    uint64 2×2 = 4.050428
    float = 0.786304
    float 2×2 = 0.785491
    double = 10.993460
    double 2×2 = 5.679850

    $ current-gcc -funroll-loops -Ofast -mno-sse2 scalar.c ; ./a.out
    uint32 = 2.007275
    uint64 = 2.011980
    uint32 2×2 = 2.006285
    uint64 2×2 = 2.013779
    float = 0.942085
    float 2×2 = 2.471436
    double = 279.546434
    double 2×2 = 142.930305

    $ icc -fast scalar.c ; ./a.out
    uint32 = 0.766303
    uint64 = 3.629643
    uint32 2×2 = 1.520190
    uint64 2×2 = 3.605145
    float = 0.512532
    float 2×2 = 0.827342
    double = 2.104804
    double 2×2 = 2.390149

  13. Aleksey Bader says:

    Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.
    Pretty strange…
    I would expect to see ymm registers utilization.

  14. Aleksey Bader says:

    I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.
    “In theory, the SSE instructions are ideally suited to the computation of scalar products.”
    It’s true only for SOA data layout and there is no data dependency between loop iterations.
    I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.
    I’m surprised that gcc failed to optimize so common problem.

  15. Aleksey Bader says:

    Did see the latests update – you updating your post too quick for me.
    I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.

  16. John Regehr says:

    Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.

  17. Anonymous says:

    “The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”

    In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.

    As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. 🙂

  18. John Regehr says:

    Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…

    Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.