(1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:
sales = price * quantity
is a scalar product. Also, it is often a bottleneck.
Image processing is filled with scalar products. Computer graphics use scalar products all the time.
(2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.
Stanley Leesays:
Wow, there are a lot of intricate details to know here that could be useful. Thanks for sharing!
John Regehrsays:
Nice to see measurement-based blogging :).
When compiling any code of this sort, you should try the Intel compiler as well.
10,000 reps may be enough for your simple rdtsc to give reliable results, but I wouldn’t count on it. It’s easy to do better:
Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.
Pretty strange…
I would expect to see ymm registers utilization.
Aleksey Badersays:
I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.
“In theory, the SSE instructions are ideally suited to the computation of scalar products.”
It’s true only for SOA data layout and there is no data dependency between loop iterations.
I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.
I’m surprised that gcc failed to optimize so common problem.
Aleksey Badersays:
Oh…
Did see the latests update – you updating your post too quick for me.
I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.
Thanks.
John Regehrsays:
Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.
Anonymoussays:
“The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”
In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.
As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. 🙂
John Regehrsays:
Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…
Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.
@John
Thanks!
Interesting. Thank you for sharing your results!
@Aleksey
Thanks. I confirm your good results, I will update the blog post.
@John
I’m trying to update my results the best I can.
Who knew that a technical blog post could be so time consuming?
@Aleksey
I know, I don’t usually update posts so fast. Sorry.
@John
Good question. I wonder why it would make a difference for integer arithmetic. Something to do with overflows maybe?
@Anonymous
(1) Scalar products are not limited to linear algebra. Almost all software uses something like a scalar product. Something like this:
sales = price * quantity
is a scalar product. Also, it is often a bottleneck.
Image processing is filled with scalar products. Computer graphics use scalar products all the time.
(2) BLAS implementations almost certainly use something that looks like a scalar product underneath. You may not want to tinker with it, but it will still be impacted by the features of your processors and of the compiler.
Wow, there are a lot of intricate details to know here that could be useful. Thanks for sharing!
Nice to see measurement-based blogging :).
When compiling any code of this sort, you should try the Intel compiler as well.
10,000 reps may be enough for your simple rdtsc to give reliable results, but I wouldn’t count on it. It’s easy to do better:
http://blog.regehr.org/archives/330
It’s probably worth to enable -mavx optimization option on recent core i7.
I’ve got speed up for 32-bit integers: from 0.991442 to 0.438392.
BTW, for 64-bit integer with SSE2 enabled I have best result – 1.073550 (manually unrolled loop).
The rest is pretty similar.
gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Here’s what I get on an i7-920, which does not have AVX. “current-gcc” is from a few days ago, “icc” is Intel 12.1.0.
$ current-gcc -funroll-loops -Ofast scalar.c ; ./a.out
uint32 = 1.041528
uint64 = 4.266846
uint32 2×2 = 1.347658
uint64 2×2 = 4.050428
float = 0.786304
float 2×2 = 0.785491
double = 10.993460
double 2×2 = 5.679850
$ current-gcc -funroll-loops -Ofast -mno-sse2 scalar.c ; ./a.out
uint32 = 2.007275
uint64 = 2.011980
uint32 2×2 = 2.006285
uint64 2×2 = 2.013779
float = 0.942085
float 2×2 = 2.471436
double = 279.546434
double 2×2 = 142.930305
$ icc -fast scalar.c ; ./a.out
uint32 = 0.766303
uint64 = 3.629643
uint32 2×2 = 1.520190
uint64 2×2 = 3.605145
float = 0.512532
float 2×2 = 0.827342
double = 2.104804
double 2×2 = 2.390149
Even though my CPU has AVX, gcc utilizes only 4-element registers (xmm) even with -mavx option enabled.
Pretty strange…
I would expect to see ymm registers utilization.
I played a bit with the code and it seems gcc is not good enough to optimize scalar product with SIMD.
“In theory, the SSE instructions are ideally suited to the computation of scalar products.”
It’s true only for SOA data layout and there is no data dependency between loop iterations.
I separated multiplication of two arrays and accumulation: 3.583192 (compare with 10). More then 2x speed up because of full utilization of AVX registers.
I’m surprised that gcc failed to optimize so common problem.
Oh…
Did see the latests update – you updating your post too quick for me.
I seem to need GCC upgrade because -Ofast option doesn’t help for 4.6.1.
Thanks.
Another thing to ask is, does -ffast-math affect results in a way that matters for this application? I believe -Ofast enables that flag, and perhaps others that may have observable behavior. I’m not sure if icc’s -fast flag has similar consequences.
“The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. ”
In high performance computing, normally you do not compute matrix products directly through scalar products, but rely on blocking as much as you can. The reason is that loading data to the cache is expensive, so you’d rather perform as much computation as you can while they’re in, and for this blocking is more effective than successive scalar products.
As for which is the best algorithm, the general answer for floating point numbers is “don’t do it yourself, rely a highly optimized BLAS implementation”. 🙂
Daniel, I don’t know the implications of -ffast-math. I’m basically an integer person…
Re. BLAS, sure — but it’s still fun (for some of us at least) to understand the interactions between, and limitations of, our compilers and SIMD units.