25th March 2015, 12 min read

Accelerating intersections with SIMD instructions

16 thoughts on “Accelerating intersections with SIMD instructions”

J. Andrew Rogers says:

March 25, 2015 at 12:56 pm

Superscalar architectures are significantly underexploited in practice, both by compilers and the microarchitectures themselves. While you gain some benefit automatically, the automatic exploitation still misses a large percentage of the internal parallelism available when you have several ALUs per core.

There is an informal discipline around writing C/C++ such that it exposes the ALU parallelism to the CPU while being a trivial and null reorganization of the code. Few people actually write code this way and the idioms are a bit outside the way programmers have learned to write their code; most programmers do not even know how to do it. Nonetheless, coding for ALU saturation reliably delivers 1.5-2x performance relative to what would otherwise be considered to be optimized C algorithms. This kind of uplift is particularly noticeable on more recent Intel cores like Haswell.

Exploiting superscalar microarchitectures has become worse as the number and complexity of the ALUs has increased.
Daniel Lemire says:

March 25, 2015 at 1:24 pm

@Andrew

Agreed.

I would add that you do not need to write in C to optimize for superscalar execution. I bet that you can measure the effects in JavaScript these days…
Ingo LÃ¼tkebohle says:

March 26, 2015 at 4:04 am

Sounds interesting 🙂 where can i find more info on this way of coding?
Daniel Lemire says:

March 26, 2015 at 2:28 pm

@Ingo

It is a good question.
Rajiv says:

March 26, 2015 at 8:02 pm

Great article. For any one looking to find out about techniques/examples of creative uses of SIMD code, I have had a lot of luck looking up literature on gaming engines. An example here is a talk at GDC from an engineer at Insomniac games: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

Game engines are kind of ideal for SIMD usage given they do the same work on a large number of objects over and over. Personally I have had pretty decent luck (more hits than misses) in employing SIMD code in general applications to speed things up. One of the really nice things SIMD code does for you is to make you really think about your memory layout. This kind of data oriented design thinking in my experience leads to great speedups even when it turns out that you cannot use SIMD.
Rajiv says:

March 26, 2015 at 8:08 pm

I have also found SIMD to be of great use in encoding/decoding columnar data . Given most applications (even the ones without any disk/network IO) are memory bandwidth bound, encoding columns of data to make such smaller is often a win. SIMD lends itself pretty well for interesting ways to encode and decode such data on the fly without having to resort to full blown compression, which in many cases is way too expensive.
Daniel I’d love to see more articles from you about creative uses of SIMD. You seem to have published a lot of papers where you’ve used SIMD instructions to great effect. Great article and hope for more.
Philippecp says:

March 26, 2015 at 10:58 pm

I’ve been interested in this paper and library for a while. Unfortunately the code isn’t portable as it doesn’t compile in MSVC. This is mostly because a few lines of inline assembly instead of intrinsics and some compiler specifics (__builtin_expect, etc.). Is there any plan to make this library portable so it can be more widely used?
Philippecp says:

March 26, 2015 at 11:02 pm

@Andrew
Are you referring to the rules that allow for auto vectorization by the compiler like the ones mentioned in this block series:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/12/auto-vectorizer-in-visual-studio-11-overview.aspx
Daniel Lemire says:

March 27, 2015 at 8:49 am

@Philippecp

As I do not program using MSVC, I rely on contributors (like you) to help with portability.

We have MSVC support for other projects, such as this one…

https://github.com/lemire/FastPFor

With cmake, it is reasonably easy to make portable builds.

I believe that someone could probably get the project to build under MSVC with a few hours of work. If you are interested in helping out, please get in touch!
Daniel Lemire says:

March 30, 2015 at 11:34 am

@Philippecp

I have checked in a revision. The code no longer relies on inline assembly and __builtin_expect. So it should definitively be easier to use it with MSVC.

We still lack the build routines which could be provided with cmake.
Bogdan Burlacu says:

December 12, 2019 at 1:21 pm

Hi,

How about intersecting arrays of 64-bit integers? Any special trick or just replace _mm128 instructions with their _mm256 counterparts, f suffix for float with d suffix for double, and so on?

Thanks,
Bogdan
1. Daniel Lemire says:
  
  December 12, 2019 at 2:06 pm
  
  Using 64-bit integers reduce the benefits of SIMD instructions. I would first try to reengineer the problem.
  1. Bogdan Burlacu says:
    
    December 12, 2019 at 2:33 pm
    
    Thank you for your answer.
    Unfortunately that is not possible — the values come from hashing a very large number of objects where the number of unique objects might exceed the range of a 32-bit integer. But I am still exploring possible options for re-engineering. I think even without a large benefit a SIMD implementation would still be faster than the scalar version.
    1. Daniel Lemire says:
      
      December 12, 2019 at 2:42 pm
      
      If you have 64-bit hash values, what is the probability that they will collide on their least significant 32 bits? Is it 1/2^32? Or 2e-8 percent?
      1. Bogdan Burlacu says:
        
        December 12, 2019 at 8:50 pm
        
        The author of XXHash (the one I use) claims that “no bit can be used as a possible predictor for another bit.”
        So my initial guess would be 1/2^32.
        
        However, I did a practical test and I got:
        
        1.161% collision rate on the lower 32-bits for 100M unique 64-bit hash values
        2.305% collision rate for 200M
        3.431% collision rate for 300M
        
        Which leaves me somewhere in the 1e-8 ballpark.
        I am considering whether this is an artifact of how I’m doing the hashing. But this means that the SIMD 32-bit intersection would be somewhat lossy.
        
        Daniel Lemire says:
        
        December 12, 2019 at 11:06 pm
        
        Please see my post: Are 64-bit random identifiers free from collision? https://lemire.me/blog/2019/12/12/are-64-bit-random-identifiers-free-from-collision/