Superscalar architectures are significantly underexploited in practice, both by compilers and the microarchitectures themselves. While you gain some benefit automatically, the automatic exploitation still misses a large percentage of the internal parallelism available when you have several ALUs per core.
There is an informal discipline around writing C/C++ such that it exposes the ALU parallelism to the CPU while being a trivial and null reorganization of the code. Few people actually write code this way and the idioms are a bit outside the way programmers have learned to write their code; most programmers do not even know how to do it. Nonetheless, coding for ALU saturation reliably delivers 1.5-2x performance relative to what would otherwise be considered to be optimized C algorithms. This kind of uplift is particularly noticeable on more recent Intel cores like Haswell.
Exploiting superscalar microarchitectures has become worse as the number and complexity of the ALUs has increased.
Great article. For any one looking to find out about techniques/examples of creative uses of SIMD code, I have had a lot of luck looking up literature on gaming engines. An example here is a talk at GDC from an engineer at Insomniac games: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
Game engines are kind of ideal for SIMD usage given they do the same work on a large number of objects over and over. Personally I have had pretty decent luck (more hits than misses) in employing SIMD code in general applications to speed things up. One of the really nice things SIMD code does for you is to make you really think about your memory layout. This kind of data oriented design thinking in my experience leads to great speedups even when it turns out that you cannot use SIMD.
Rajivsays:
I have also found SIMD to be of great use in encoding/decoding columnar data . Given most applications (even the ones without any disk/network IO) are memory bandwidth bound, encoding columns of data to make such smaller is often a win. SIMD lends itself pretty well for interesting ways to encode and decode such data on the fly without having to resort to full blown compression, which in many cases is way too expensive.
Daniel I’d love to see more articles from you about creative uses of SIMD. You seem to have published a lot of papers where you’ve used SIMD instructions to great effect. Great article and hope for more.
Philippecpsays:
I’ve been interested in this paper and library for a while. Unfortunately the code isn’t portable as it doesn’t compile in MSVC. This is mostly because a few lines of inline assembly instead of intrinsics and some compiler specifics (__builtin_expect, etc.). Is there any plan to make this library portable so it can be more widely used?
With cmake, it is reasonably easy to make portable builds.
I believe that someone could probably get the project to build under MSVC with a few hours of work. If you are interested in helping out, please get in touch!
I have checked in a revision. The code no longer relies on inline assembly and __builtin_expect. So it should definitively be easier to use it with MSVC.
We still lack the build routines which could be provided with cmake.
Bogdan Burlacusays:
Hi,
How about intersecting arrays of 64-bit integers? Any special trick or just replace _mm128 instructions with their _mm256 counterparts, f suffix for float with d suffix for double, and so on?
Using 64-bit integers reduce the benefits of SIMD instructions. I would first try to reengineer the problem.
Bogdan Burlacusays:
Thank you for your answer.
Unfortunately that is not possible — the values come from hashing a very large number of objects where the number of unique objects might exceed the range of a 32-bit integer. But I am still exploring possible options for re-engineering. I think even without a large benefit a SIMD implementation would still be faster than the scalar version.
If you have 64-bit hash values, what is the probability that they will collide on their least significant 32 bits? Is it 1/2^32? Or 2e-8 percent?
Bogdan Burlacusays:
The author of XXHash (the one I use) claims that “no bit can be used as a possible predictor for another bit.”
So my initial guess would be 1/2^32.
However, I did a practical test and I got:
1.161% collision rate on the lower 32-bits for 100M unique 64-bit hash values
2.305% collision rate for 200M
3.431% collision rate for 300M
Which leaves me somewhere in the 1e-8 ballpark.
I am considering whether this is an artifact of how I’m doing the hashing. But this means that the SIMD 32-bit intersection would be somewhat lossy.
Superscalar architectures are significantly underexploited in practice, both by compilers and the microarchitectures themselves. While you gain some benefit automatically, the automatic exploitation still misses a large percentage of the internal parallelism available when you have several ALUs per core.
There is an informal discipline around writing C/C++ such that it exposes the ALU parallelism to the CPU while being a trivial and null reorganization of the code. Few people actually write code this way and the idioms are a bit outside the way programmers have learned to write their code; most programmers do not even know how to do it. Nonetheless, coding for ALU saturation reliably delivers 1.5-2x performance relative to what would otherwise be considered to be optimized C algorithms. This kind of uplift is particularly noticeable on more recent Intel cores like Haswell.
Exploiting superscalar microarchitectures has become worse as the number and complexity of the ALUs has increased.
@Andrew
Agreed.
I would add that you do not need to write in C to optimize for superscalar execution. I bet that you can measure the effects in JavaScript these days…
Sounds interesting 🙂 where can i find more info on this way of coding?
@Ingo
It is a good question.
Great article. For any one looking to find out about techniques/examples of creative uses of SIMD code, I have had a lot of luck looking up literature on gaming engines. An example here is a talk at GDC from an engineer at Insomniac games: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
Game engines are kind of ideal for SIMD usage given they do the same work on a large number of objects over and over. Personally I have had pretty decent luck (more hits than misses) in employing SIMD code in general applications to speed things up. One of the really nice things SIMD code does for you is to make you really think about your memory layout. This kind of data oriented design thinking in my experience leads to great speedups even when it turns out that you cannot use SIMD.
I have also found SIMD to be of great use in encoding/decoding columnar data . Given most applications (even the ones without any disk/network IO) are memory bandwidth bound, encoding columns of data to make such smaller is often a win. SIMD lends itself pretty well for interesting ways to encode and decode such data on the fly without having to resort to full blown compression, which in many cases is way too expensive.
Daniel I’d love to see more articles from you about creative uses of SIMD. You seem to have published a lot of papers where you’ve used SIMD instructions to great effect. Great article and hope for more.
I’ve been interested in this paper and library for a while. Unfortunately the code isn’t portable as it doesn’t compile in MSVC. This is mostly because a few lines of inline assembly instead of intrinsics and some compiler specifics (__builtin_expect, etc.). Is there any plan to make this library portable so it can be more widely used?
@Andrew
Are you referring to the rules that allow for auto vectorization by the compiler like the ones mentioned in this block series:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/12/auto-vectorizer-in-visual-studio-11-overview.aspx
@Philippecp
As I do not program using MSVC, I rely on contributors (like you) to help with portability.
We have MSVC support for other projects, such as this one…
https://github.com/lemire/FastPFor
With cmake, it is reasonably easy to make portable builds.
I believe that someone could probably get the project to build under MSVC with a few hours of work. If you are interested in helping out, please get in touch!
@Philippecp
I have checked in a revision. The code no longer relies on inline assembly and __builtin_expect. So it should definitively be easier to use it with MSVC.
We still lack the build routines which could be provided with cmake.
Hi,
How about intersecting arrays of 64-bit integers? Any special trick or just replace
_mm128
instructions with their_mm256
counterparts,f
suffix forfloat
withd
suffix fordouble
, and so on?Thanks,
Bogdan
Using 64-bit integers reduce the benefits of SIMD instructions. I would first try to reengineer the problem.
Thank you for your answer.
Unfortunately that is not possible — the values come from hashing a very large number of objects where the number of unique objects might exceed the range of a 32-bit integer. But I am still exploring possible options for re-engineering. I think even without a large benefit a SIMD implementation would still be faster than the scalar version.
If you have 64-bit hash values, what is the probability that they will collide on their least significant 32 bits? Is it 1/2^32? Or 2e-8 percent?
The author of XXHash (the one I use) claims that “no bit can be used as a possible predictor for another bit.”
So my initial guess would be 1/2^32.
However, I did a practical test and I got:
1.161% collision rate on the lower 32-bits for 100M unique 64-bit hash values
2.305% collision rate for 200M
3.431% collision rate for 300M
Which leaves me somewhere in the 1e-8 ballpark.
I am considering whether this is an artifact of how I’m doing the hashing. But this means that the SIMD 32-bit intersection would be somewhat lossy.
Please see my post: Are 64-bit random identifiers free from collision? https://lemire.me/blog/2019/12/12/are-64-bit-random-identifiers-free-from-collision/