14th September 2022, 14 min read

Escaping strings faster with AVX-512

22 thoughts on “Escaping strings faster with AVX-512”

Attila says:

September 15, 2022 at 7:07 am

How would this be extended to quote *all* characters that need quoting? Tabs, backslashes and whatnot.
1. Daniel Lemire says:
  
  September 15, 2022 at 3:21 pm
  
  It gets trickier, evidently. I might do it in a future blog post.
  1. Bernard B says:
    
    September 16, 2022 at 10:01 am
    
    This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
    1. no says:
      
      September 17, 2022 at 5:57 am
      
      By “removed” you mean willingly installed a BIOS update that removes the AVX512 support? I’ve go an Alder Lake that has AVX512, and it’ll stay that way, because I can read BIOS changelogs.
      1. 3yav1xu says:
        
        January 1, 2023 at 1:39 pm
        
        If you expect Users to change bios settings in order to run your code on their systems, you will not have any users. Of course if you are writing software for private use this is not a concern.
Dennis says:

September 15, 2022 at 8:57 am

Hi Daniel,

I made a very similar discovery quite a few years ago in one of my university courses. It was a course on string search algorithms in which we all had to implement one of the algorithms. I decided to to a simple sliding window algorithm. Which one exactly I don’t remember but the catch was that I implemented it in x64 Assembler with the use of AVX instructions.

At the end of the course we did a comparision of the different implementations on one machine and needless to say mine blew everything away by far.
More modern implementations with dictionaries and other improvements stood no chance because of the overhead introduced by the programming languages used.

Your approach using C++ is for sure way easier to implement as I spent weeks learning and debugging assembler in NASM. After all that I wondered if there is any company that’s so much in need of performance-optimised code.
-.- says:

September 15, 2022 at 10:08 am

`VPEXPANDB` can alternatively be used to do this. You may need to use a PDEP/PEXT combo on the mask to get it to the right place though.
1. Daniel Lemire says:
  
  September 15, 2022 at 3:14 pm
  
  I wanted to use `VPEXPANDB initially but I found it easier to go in the other direction.
2. Attila says:
  
  September 15, 2022 at 5:37 pm
  
  Would it still be 5x faster? I am just worrying a bit that someone inexperienced goes and says “yay, five times faster!” and implements a quote without realizing that it does not quote everything. And things like that tend to turn into vulnerabilities.
  1. -.- says:
    
    September 15, 2022 at 10:43 pm
    
    I wouldn’t be surprised if it’s actually faster, due to reducing shuffle port pressure, but I have no benchmarks. And yes, it fully works. See the following for a bit of a writeup (aim is different, but concept is same):
    https://www.reddit.com/r/simd/comments/x10e3x/avx512vbmi2_doubling_space/
sasuke420 says:

September 15, 2022 at 10:38 am

Does it go any faster if you use pshufb then cmpeq_mask instead of two cmpeq_mask and an or?
1. Daniel Lemire says:
  
  September 15, 2022 at 3:14 pm
  
  How would that work? Can you provide a code sample?
  1. -.- says:
    
    September 15, 2022 at 10:50 pm
    
    Looks something like
    
    is_quote_or_solidus = _mm512_cmpeq_epi8_mask(
    input1,
    _mm512_shuffle_epi8(input1, _mm512_broadcast_i32x4(_mm_set_epi8(
    0, 0, 0, ‘\\’, 0, 0, 0, 0, 0, 0, 0, 0, 0, ‘”‘, 0, -1
    )))
    );
    
    Works better if there’s more characters to match, as long as the bottom 4 bits are unique between them (if not, you can prod the data to make it so).
Matt Godbolt says:

September 15, 2022 at 12:18 pm

I think I step 4 you mean “compare”, not “copy these bytes with…’
1. Daniel Lemire says:
  
  September 15, 2022 at 3:18 pm
  
  Thanks.
Wojciech Mu?a says:

September 15, 2022 at 8:40 pm

The blend __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); can be an unconditional bitwise-OR escaped = _mm512_or_si512(shifted_input1, _mm512_set1_epi16(‘\\’)). Later on, the compress applies the mask.
1. Daniel Lemire says:
  
  September 16, 2022 at 2:55 pm
  
  Correct.
Marat Kopytjuk says:

September 16, 2022 at 7:52 am

Thanks for interesting insights! I (an mechatronics engineer working mainly with Python) spent whole evening reading about SIMD/AVX512 and NEON.

I tried to compile the espace.cpp – but onfortunately it does not build with g++ 9.4 (on my ubuntu20.04-wsl2 windows machine).

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483

I think you need min. gcc-11.1.0 for your benchmark.
1. Daniel Lemire says:
  
  September 16, 2022 at 2:14 pm
  
  You definitively need a compiler with full support for VBMI2.
Bernard B says:

September 16, 2022 at 10:02 am

This whole post is outdated. Why do coders fail to keep up with hardware development? The two go hand in hand. Intel have removed avx512 instructions. But the good news is the upcoming 7000 series from AMD have introduced them.
1. Daniel Lemire says:
  
  September 16, 2022 at 2:15 pm
  
  Intel has not removed AVX-512 instructions. Some of its processors support AVX-512 instructions, others do not.
Nerijus says:

February 2, 2023 at 4:45 am

Nice article.

Wondering if similar approach can be used implementing KISS frame escapes. I KISS frame every 0xC0 must be replaced into 0xDB and 0xDC and similarly for 0xDB – to another two bytes .