Daniel Lemire's blog

, 13 min read

Recognizing string prefixes with SIMD instructions

16 thoughts on “Recognizing string prefixes with SIMD instructions”

  1. Wayne Scott says:

    You could also just build a DFA:

    const int num_tokens = ...;
    const int num_states = ...;
    int char2token[256] = { ... };
    int statetable[num_states][num_tokens] = {...};

    bool match(const char *str)
    {
    int s = 0;
    while (*str && s >= 0) {
    int tok = char2token(*str);
    s = statetable[s][tok];
    }
    return (s == -1);
    }

    A state of -2 might be the non-matching condition.

    The single branch here is way more predictable than your switch tables.

    1. Wayne Scott says:

      missing the ++ on this line:`

      int tok = char2token[*str++];

      1. That’s true, it is a viable approach worth testing.

        1. Note that you can’t just process the whole input string. In my model, the input string could be infinite for all we know. So you need more branching to terminate the processing.

          1. Wayne Scott says:

            No, my code sample will stop when it hits a null or an exit state that indicates the starting place couldn’t possibly be one of your N prefixes. The state machine indicates you are still matching one or more of the prefixes. We can even have N negative states to return which prefix was matched and an additional state for “no match”.

            1. I stand corrected, I misread your code.

              1. I have added my implementation and updated the blog post.

            2. foobar says:

              There are several ways to still reduce branch misprediction penalties on DFA code: one can stop advancing the input pointer using branchless code after seeing a specific terminator, or one can use “don’t care” DFA states on a longer buffer, which doesn’t even need to be zero-padded from the end. This way one can effectively run the loop a fixed number of iterations without unpredictable branches – which is practical if the length of possible input strings doesn’t vary much, but if it does this is not such a great idea.

              Of course the issue with general-purpose DFAs is that even the L1 load-to-load latency tends to be like three cycles, and you can’t really parallelise an individual DFA.

              There’s earlier code for a compact DFA with variations in the repo: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/b067aa68e3810de200c2b575e9130d54aeeb118d/2022/12/29/protocol.cpp#L98-L361 – it does only a match instead of separating between different correct matches but it could be easily extended to also return an index of recognised input…

    2. foobar says:

      There’s another method which doesn’t demand state table lookup dependency chains. One can create a bit vector where every bit represents a possible matching string position. On a table of masks those bits which correspond to a matching character at specific position for a specific string are set – single-character multiple choices/wildcards/don’t cares are possible by setting more bits. Now a match can be sought simply by looking up mask value for every character position in the input and ANDing these together, and looking for bits which are still set on the input.

      Since the benchmark on this blog post has over 64 candidate matches and compiles gracefully only on x86 I didn’t experiment with this approach in this case, but it should be possible to at least match typical DFA performance for small number of potential string matches.

      1. Because we have lots of strings, with various lengths, it would be hard to avoid doing a bunch of branching in the approach you propose?

        compiles gracefully only on x86

        Yes. It is specific to x86 at this time (old Westmere processors), but can be ported with relative ease to more recent (e.g., AVX-2, AVX-512, NEON) systems.

        1. foobar says:

          Well, it depends. One might have only couple unpredictable branches as a compromise. It might help, especially if there would be a practical cut-off for this purpose. I do doubt if there’s much of a benefit, though, as long as efficient hashing works.

          No problem with the fact this code is x86 only – just that formulating my code in a benchmarkable fashion is a bit tedious today. 😐

  2. KWillets says:

    I fiddled with this problem back then, and I came up with a method of building byte-sized running hashes. The property of the running hash is that it will emit the same bytes for identical prefixes, and it doesn’t need to know where the string ends, so it can be built for a fixed-size chunk of the input string and compared bytewise for matches.

    The matching step is to compare the prefix hash byte at the end of each string in the table (preprocessed) to the corresponding byte in the hashed input string, extracted with pshufb. Any hits then have to be compared against the original of course.

    An easy 8-byte running hash is (little-endian) multiplication by a constant.

    1. @KWillets Can you elaborate?

      1. KWillets says:

        Here’s the github, sorry I didn’t have it handy when posting earlier: https://github.com/KWillets/prefix_in_table/blob/master/prefix_set.cpp

        That code only indicates a hash match; a full string match is also needed.

  3. Jeff Plaisance says:

    I tested out doing this with vectorscan (hyperscan) 5.4.9 on an AMD 7950x3d with the following results:

    sse_type : 1.82 GB/s 481.4 Ma/s 2.08 ns/d 5.48 GHz 11.39 c/d 44.00 i/d 3.0 c/b 11.67 i/b 3.86 i/c
    sse_table : 1.61 GB/s 426.5 Ma/s 2.34 ns/d 5.46 GHz 12.80 c/d 50.00 i/d 3.4 c/b 13.26 i/b 3.91 i/c
    simple_trie : 0.37 GB/s 97.6 Ma/s 10.25 ns/d 5.66 GHz 57.99 c/d 44.58 i/d 15.4 c/b 11.82 i/b 0.77 i/c
    bsearch : 0.09 GB/s 24.8 Ma/s 40.33 ns/d 5.56 GHz 224.30 c/d 332.53 i/d 59.5 c/b 88.19 i/b 1.48 i/c
    sse_length : 2.70 GB/s 716.9 Ma/s 1.39 ns/d 5.23 GHz 7.30 c/d 19.00 i/d 1.9 c/b 5.04 i/b 2.60 i/c
    finite_match : 0.62 GB/s 165.4 Ma/s 6.05 ns/d 5.43 GHz 32.85 c/d 63.02 i/d 8.7 c/b 16.71 i/b 1.92 i/c
    std::lower_bound : 0.05 GB/s 13.6 Ma/s 73.45 ns/d 5.50 GHz 403.78 c/d 665.70 i/d 107.1 c/b 176.55 i/b 1.65 i/c
    vectorscan : 0.10 GB/s 27.6 Ma/s 36.26 ns/d 5.51 GHz 199.84 c/d 537.14 i/d 53.0 c/b 142.46 i/b 2.69 i/c

    I was expecting vectorscan to do better than that and found this result somewhat surprising. I think case insensitive matching is having a significant impact on vectorscan’s performance on this task.

  4. Markus Schaber says:

    I had a similar problem those days, but my code already knew the position of the separator char, and thus, the length of the string to match.
    So I could just skip when the length was smaller than the shortest keyword, or longer than the longest keyword.
    Also, I had only 1 or 2 keywords with the same length, so using the length as preselection criteria came naturally.