28th April 2022, 5 min read

Removing characters from strings faster with AVX-512

7 thoughts on “Removing characters from strings faster with AVX-512”

mischa sandberg says:

April 28, 2022 at 6:30 pm

Seeing SSE2 used for string ops caused one colleague to remark, about that and AVX, “Why didn’t Intel just make REP SCASB (et al) fast?” Torvalds might be right; meanwhile, we can be opportunistic about odd cases for using odd instruction sets. Or use them for fast APL functions 🙂
1. Adam Stylinski says:
  
  April 30, 2022 at 1:11 am
  
  I mean…they did for a few things in that family. ERMS or enhanced rep mov sb can be used for fast memsets and memmoves. Glibc is even aware of CPUs with that capability and when using that microcode trick is helpful. I think that family of “rep” semantics has some limitations for how much they can express. That and the micro op sequence it compiles to is probably a bit more complicated.
  1. mischa sandberg says:
    
    May 2, 2022 at 6:55 pm
    
    Thanks for pointing out ERMS. A benchmark in Stackoverflow wasn’t flattering (vs AVX), and the Intel Optimization Ref says ERMS’s advantage is smaller code.
    https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy.
    For SSE2 memcpy, of blocks > 192 bytes, I saw advantage from prefetchnta. Would that size be the same for AVX?
Zbynek says:

April 29, 2022 at 1:29 pm

I believe Linus Torvalds was more about the size of AVX-512 which takes huge amount of data during context switch and also about overheating which causes underclocking CPU if used too much. The same could be achieved if Intel extended 32 bytes AVX-2 and would likely have similar speed. I wrote similar code for matrix multiplication and the performance gain is far from linear, even ignoring underclocking.
1. Daniel Lemire says:
  
  April 29, 2022 at 3:13 pm
  
  AVX-512 extends the ISA for 32-byte registers. In fact, the code I describe in my blog post is easily adapted to 32-byte registers.
Ravi says:

April 30, 2022 at 2:37 am

You can discover specialized instructions in the Intel CPU using /proc/cpuinfo
Matt Williams says:

May 1, 2022 at 7:26 pm

Great article! I’m confused why you’re using _mm512_mask_compressstoreu_epi16 rather than _mm512_mask_compressstoreu_epi8 – don’t you want to write/not write at an 8-bit granularity rather than a 16-bit one?