Daniel Lemire's blog

, 6 min read

How fast can you bit-interleave 32-bit integers? (SIMD edition)

8 thoughts on “How fast can you bit-interleave 32-bit integers? (SIMD edition)”

  1. Nathan Kurz says:

    From here: *I am disappointed as I expected vector code to be about twice as fast.*

    From previous: *The GCC compiler seems to hate my deinterleaving code and produces very slow binaries.*

    I haven’t gone through to figure out the expected speeds, but if you think it should be twice as fast, it seems likely that it could be made twice as fast. Is there reason to believe that this isn’t another case of the compiler(s) creating suboptimal assembly for your otherwise fast algorithm?

    My quick guess would be that the compiler is reloading all the 256-bit constants on every call. Normally, this wouldn’t slow things down, but in a tightly tuned approach like yours it might be the limiting factor. Would 8 loads (and a possibly complex addressed store competing for the same address units) account for the observed speed?

    1. Kendall’s comment pushed me to update the post with a table lookup approach, based on nibbles. My vector code is now clearly faster than pdep.

      Possibly someone (maybe Kendall) can do even better.

  2. KWillets says:

    I think the zero-interleaving can be around 6 AVX instructions per four 32-bit int’s. The key is to shift/mask them into even and odd nybbles and look up their 8-bit expansions with vpshufb. After that it’s just matter of getting the 32 bytes into the right order.

  3. James Prichard says:

    Your nybble version is clever. Might it be suitable to contribute Intel’s ray-tracing kernel framework?


    1. Hmmm… the same idea could be used, it looks like…

      1. KWillets says:

        I put my code in a small repo here: https://github.com/KWillets/simd_interleave . It’s slightly different from Daniel’s problem definition, so I didn’t try to integrate it with his repo or tests, unfortunately.

  4. Nils says:

    I’m late to the party, but I just want to point out, that on architectures where no pdep instruction is available, you can often abuse the galois field arithmetic instructions which made it into the instruction sets to speed up cryptography.

    on ARM NEON for example if you square a number using vmul_p8 you’re bit interleaving with zeroes. On Intel PCLMULQDQ will do the same trick.

  5. Stefano Trevisani says:

    Very nice! I ended up here while I was trying to compile my code (which uses pdep) on ARM, unfortunately yep, that’s what RISC means I guess…
    Just a note on the tests: please please use struct timespec/clock_gettime(). If you are just as picky as me about these little things, it will make a “big” difference on precision and WHAT you are going to actually measure (e.g. thread-time only clocks) 😀