Seeing SSE2 used for string ops caused one colleague to remark, about that and AVX, “Why didn’t Intel just make REP SCASB (et al) fast?” Torvalds might be right; meanwhile, we can be opportunistic about odd cases for using odd instruction sets. Or use them for fast APL functions 🙂
Adam Stylinskisays:
I mean…they did for a few things in that family. ERMS or enhanced rep mov sb can be used for fast memsets and memmoves. Glibc is even aware of CPUs with that capability and when using that microcode trick is helpful. I think that family of “rep” semantics has some limitations for how much they can express. That and the micro op sequence it compiles to is probably a bit more complicated.
mischa sandbergsays:
Thanks for pointing out ERMS. A benchmark in Stackoverflow wasn’t flattering (vs AVX), and the Intel Optimization Ref says ERMS’s advantage is smaller code. https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy.
For SSE2 memcpy, of blocks > 192 bytes, I saw advantage from prefetchnta. Would that size be the same for AVX?
I believe Linus Torvalds was more about the size of AVX-512 which takes huge amount of data during context switch and also about overheating which causes underclocking CPU if used too much. The same could be achieved if Intel extended 32 bytes AVX-2 and would likely have similar speed. I wrote similar code for matrix multiplication and the performance gain is far from linear, even ignoring underclocking.
AVX-512 extends the ISA for 32-byte registers. In fact, the code I describe in my blog post is easily adapted to 32-byte registers.
Ravisays:
You can discover specialized instructions in the Intel CPU using /proc/cpuinfo
Matt Williamssays:
Great article! I’m confused why you’re using _mm512_mask_compressstoreu_epi16 rather than _mm512_mask_compressstoreu_epi8 – don’t you want to write/not write at an 8-bit granularity rather than a 16-bit one?
Seeing SSE2 used for string ops caused one colleague to remark, about that and AVX, “Why didn’t Intel just make REP SCASB (et al) fast?” Torvalds might be right; meanwhile, we can be opportunistic about odd cases for using odd instruction sets. Or use them for fast APL functions 🙂
I mean…they did for a few things in that family. ERMS or enhanced rep mov sb can be used for fast memsets and memmoves. Glibc is even aware of CPUs with that capability and when using that microcode trick is helpful. I think that family of “rep” semantics has some limitations for how much they can express. That and the micro op sequence it compiles to is probably a bit more complicated.
Thanks for pointing out ERMS. A benchmark in Stackoverflow wasn’t flattering (vs AVX), and the Intel Optimization Ref says ERMS’s advantage is smaller code.
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy.
For SSE2 memcpy, of blocks > 192 bytes, I saw advantage from prefetchnta. Would that size be the same for AVX?
I believe Linus Torvalds was more about the size of AVX-512 which takes huge amount of data during context switch and also about overheating which causes underclocking CPU if used too much. The same could be achieved if Intel extended 32 bytes AVX-2 and would likely have similar speed. I wrote similar code for matrix multiplication and the performance gain is far from linear, even ignoring underclocking.
AVX-512 extends the ISA for 32-byte registers. In fact, the code I describe in my blog post is easily adapted to 32-byte registers.
You can discover specialized instructions in the Intel CPU using /proc/cpuinfo
Great article! I’m confused why you’re using _mm512_mask_compressstoreu_epi16 rather than _mm512_mask_compressstoreu_epi8 – don’t you want to write/not write at an 8-bit granularity rather than a 16-bit one?