Daniel Lemire's blog

, 5 min read

Removing characters from strings faster with AVX-512

7 thoughts on “Removing characters from strings faster with AVX-512”

  1. mischa sandberg says:

    Seeing SSE2 used for string ops caused one colleague to remark, about that and AVX, “Why didn’t Intel just make REP SCASB (et al) fast?” Torvalds might be right; meanwhile, we can be opportunistic about odd cases for using odd instruction sets. Or use them for fast APL functions 🙂

    1. Adam Stylinski says:

      I mean…they did for a few things in that family. ERMS or enhanced rep mov sb can be used for fast memsets and memmoves. Glibc is even aware of CPUs with that capability and when using that microcode trick is helpful. I think that family of “rep” semantics has some limitations for how much they can express. That and the micro op sequence it compiles to is probably a bit more complicated.

      1. mischa sandberg says:

        Thanks for pointing out ERMS. A benchmark in Stackoverflow wasn’t flattering (vs AVX), and the Intel Optimization Ref says ERMS’s advantage is smaller code.
        https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy.
        For SSE2 memcpy, of blocks > 192 bytes, I saw advantage from prefetchnta. Would that size be the same for AVX?

  2. Zbynek says:

    I believe Linus Torvalds was more about the size of AVX-512 which takes huge amount of data during context switch and also about overheating which causes underclocking CPU if used too much. The same could be achieved if Intel extended 32 bytes AVX-2 and would likely have similar speed. I wrote similar code for matrix multiplication and the performance gain is far from linear, even ignoring underclocking.

    1. AVX-512 extends the ISA for 32-byte registers. In fact, the code I describe in my blog post is easily adapted to 32-byte registers.

  3. Ravi says:

    You can discover specialized instructions in the Intel CPU using /proc/cpuinfo

  4. Matt Williams says:

    Great article! I’m confused why you’re using _mm512_mask_compressstoreu_epi16 rather than _mm512_mask_compressstoreu_epi8 – don’t you want to write/not write at an 8-bit granularity rather than a 16-bit one?