It gets trickier, evidently. I might do it in a future blog post.
Bernard Bsays:
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
nosays:
By “removed” you mean willingly installed a BIOS update that removes the AVX512 support? I’ve go an Alder Lake that has AVX512, and it’ll stay that way, because I can read BIOS changelogs.
3yav1xusays:
If you expect Users to change bios settings in order to run your code on their systems, you will not have any users. Of course if you are writing software for private use this is not a concern.
Dennissays:
Hi Daniel,
I made a very similar discovery quite a few years ago in one of my university courses. It was a course on string search algorithms in which we all had to implement one of the algorithms. I decided to to a simple sliding window algorithm. Which one exactly I don’t remember but the catch was that I implemented it in x64 Assembler with the use of AVX instructions.
At the end of the course we did a comparision of the different implementations on one machine and needless to say mine blew everything away by far.
More modern implementations with dictionaries and other improvements stood no chance because of the overhead introduced by the programming languages used.
Your approach using C++ is for sure way easier to implement as I spent weeks learning and debugging assembler in NASM. After all that I wondered if there is any company that’s so much in need of performance-optimised code.
-.-says:
`VPEXPANDB` can alternatively be used to do this. You may need to use a PDEP/PEXT combo on the mask to get it to the right place though.
I wanted to use `VPEXPANDB initially but I found it easier to go in the other direction.
Attilasays:
Would it still be 5x faster? I am just worrying a bit that someone inexperienced goes and says “yay, five times faster!” and implements a quote without realizing that it does not quote everything. And things like that tend to turn into vulnerabilities.
-.-says:
I wouldn’t be surprised if it’s actually faster, due to reducing shuffle port pressure, but I have no benchmarks. And yes, it fully works. See the following for a bit of a writeup (aim is different, but concept is same): https://www.reddit.com/r/simd/comments/x10e3x/avx512vbmi2_doubling_space/
sasuke420says:
Does it go any faster if you use pshufb then cmpeq_mask instead of two cmpeq_mask and an or?
The blend __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); can be an unconditional bitwise-OR escaped = _mm512_or_si512(shifted_input1, _mm512_set1_epi16(‘\\’)). Later on, the compress applies the mask.
You definitively need a compiler with full support for VBMI2.
Bernard Bsays:
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
Intel has not removed AVX-512 instructions. Some of its processors support AVX-512 instructions, others do not.
Nerijussays:
Nice article.
Wondering if similar approach can be used implementing KISS frame escapes. I KISS frame every 0xC0 must be replaced into 0xDB and 0xDC and similarly for 0xDB – to another two bytes .
How would this be extended to quote *all* characters that need quoting? Tabs, backslashes and whatnot.
It gets trickier, evidently. I might do it in a future blog post.
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
By “removed” you mean willingly installed a BIOS update that removes the AVX512 support? I’ve go an Alder Lake that has AVX512, and it’ll stay that way, because I can read BIOS changelogs.
If you expect Users to change bios settings in order to run your code on their systems, you will not have any users. Of course if you are writing software for private use this is not a concern.
Hi Daniel,
I made a very similar discovery quite a few years ago in one of my university courses. It was a course on string search algorithms in which we all had to implement one of the algorithms. I decided to to a simple sliding window algorithm. Which one exactly I don’t remember but the catch was that I implemented it in x64 Assembler with the use of AVX instructions.
At the end of the course we did a comparision of the different implementations on one machine and needless to say mine blew everything away by far.
More modern implementations with dictionaries and other improvements stood no chance because of the overhead introduced by the programming languages used.
Your approach using C++ is for sure way easier to implement as I spent weeks learning and debugging assembler in NASM. After all that I wondered if there is any company that’s so much in need of performance-optimised code.
`VPEXPANDB` can alternatively be used to do this. You may need to use a PDEP/PEXT combo on the mask to get it to the right place though.
I wanted to use `VPEXPANDB initially but I found it easier to go in the other direction.
Would it still be 5x faster? I am just worrying a bit that someone inexperienced goes and says “yay, five times faster!” and implements a quote without realizing that it does not quote everything. And things like that tend to turn into vulnerabilities.
I wouldn’t be surprised if it’s actually faster, due to reducing shuffle port pressure, but I have no benchmarks. And yes, it fully works. See the following for a bit of a writeup (aim is different, but concept is same):
https://www.reddit.com/r/simd/comments/x10e3x/avx512vbmi2_doubling_space/
Does it go any faster if you use pshufb then cmpeq_mask instead of two cmpeq_mask and an or?
How would that work? Can you provide a code sample?
Looks something like
is_quote_or_solidus = _mm512_cmpeq_epi8_mask(
input1,
_mm512_shuffle_epi8(input1, _mm512_broadcast_i32x4(_mm_set_epi8(
0, 0, 0, ‘\\’, 0, 0, 0, 0, 0, 0, 0, 0, 0, ‘”‘, 0, -1
)))
);
Works better if there’s more characters to match, as long as the bottom 4 bits are unique between them (if not, you can prod the data to make it so).
I think I step 4 you mean “compare”, not “copy these bytes with…’
Thanks.
The blend __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); can be an unconditional bitwise-OR escaped = _mm512_or_si512(shifted_input1, _mm512_set1_epi16(‘\\’)). Later on, the compress applies the mask.
Correct.
Thanks for interesting insights! I (an mechatronics engineer working mainly with Python) spent whole evening reading about SIMD/AVX512 and NEON.
I tried to compile the espace.cpp – but onfortunately it does not build with g++ 9.4 (on my ubuntu20.04-wsl2 windows machine).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483
I think you need min. gcc-11.1.0 for your benchmark.
You definitively need a compiler with full support for VBMI2.
This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
Intel has not removed AVX-512 instructions. Some of its processors support AVX-512 instructions, others do not.
Nice article.
Wondering if similar approach can be used implementing KISS frame escapes. I KISS frame every 0xC0 must be replaced into 0xDB and 0xDC and similarly for 0xDB – to another two bytes .