Daniel Lemire's blog

, 34 min read

Quickly checking that a string belongs to a small set

37 thoughts on “Quickly checking that a string belongs to a small set”

  1. Volker Simonis says:

    Just out of interest, have you tried to measure a regex-based solution as well?

    1. In my tests, regex is 100x slower. Of course, results will vary based on the underlying implementation.

    2. Nathan Myers says:

      Seems like a good place for a bloom filter.

      Presumably the overwhelming majority of words seen don’t match, so you would win by rejecting non-matches quickly.

      1. Jens Alfke says:

        A Bloom filter will require hashing the string six or seven times with different hash functions. And it has false positives, so you still have to do a real set-membership test if the filter returns true.

        I didn’t see anything in the post stating that matches will be rare, and in the benchmark code 60% of the strings match, so this doesn’t seem like a good trade-off.

        1. A conventional Bloom filter would be overkill here, but a hashing based technique could work.

        2. Herb Alist says:

          There is no need for different hash functions with a bloom filter. You just mix in a different salt for each iteration.

          But if the list of strings is known in advance as in this example, then gperf would be even simpler.

          1. I have added gperf to the benchmark. It is competitive but it is not a match for the fast approaches I propose. This being said, it could maybe be made competitive with some extra engineering.

  2. Wouter Bijlsma says:

    The ‘direct’ example uses bitwise | instead of logical ||, so no short-circuit evaluation? I guess that will affect the timing to 0.5x what you have now (amortised, if the needle is uniformly distributed)

    1. The code has both functions. Results are sensitive to the compiler.

      1. I have updated the blog post to include both versions.

  3. JRL says:

    It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes. In practice, the parameter to memcpy should be input.size() instead of sizeof(uint..), but that seems to tank the performance.
    I played a bit with the example (on godbolt: zd9c9YM8b), and the “fast” implementation was the most stable (similar duration over multiple runs). The branchless implementation is consistently the fastest, and the hash version is the sweet spot between clarity and performance.
    I also tried a lame implementation of a letter graph, and it seems to be somewhere in between “branchless” and “fast”.

    1. It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes.

      For a general-purpose function, I agree, but if it is part of your own system where you control where the strings come from, then I disagree. Requiring padding to your strings is quite doable if you have fine control over your data structures.

  4. Mark Hahn says:

    I wonder when perfect hashing becomes the winner.

  5. Xoranth says:

    Wojciech Mula has written a note about a similar problem in the past. He found that using a perfect hash function was the fastest way, and that switching character by character was faster than SWAR techniques like the one you mention in this post.

    I.e. see
    http://0x80.pl/notesen/2022-01-29-http-verb-parse.html

    1. Damn it. Wojciech is always one step ahead of me.

      (For people reading this, Wojciech is a collaborator of mine.)

    2. Patrick says:

      In my functional programming language Fexl I’m using an “index” technique which identifies maximal common prefixes among keys, identifying the optimal sequences of individual three-way character comparisons to execute before doing a final definitive full key comparison.

      Here are the operations index_get, index_put, index_del, index_pairs:

      https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/index.fxl

      Here is a test suite:

      https://github.com/chkoreff/Fexl/blob/master/src/test/b15.fxl

      My next step is to write a Fexl program which, when given a set of key-value pairs, automatically writes an optimal C function which looks up the value associated with a key. By “optimal” I mean using the index technique, not using a hash function. It would write the C function to be given the length of the key up front, not relying on trailing NUL.

      1. Patrick says:

        I may also write the index_get, index_put, and other operations in native C instead of Fexl, but it’s not a bottleneck for me so there’s no hurry. I just think auto-generating C code for a particular index value will be interesting. I may even use the result to do the lookup for the set of standard Fexl functions already written directly in C.

      2. Patrick says:

        OK, I’ve written some code that “compiles” a list of key-value pairs into C code. The keys and values must be strings, and I don’t yet do anything to detect special characters which must be escaped, or multi-line strings.

        Here’s a test suite that tries a few different lists:

        https://github.com/chkoreff/Fexl/blob/master/src/test/index_C.fxl

        Here’s the reference output for that:

        https://github.com/chkoreff/Fexl/blob/master/src/out/index_C

        Here’s the Fexl code which does the actual code generation work:

        https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/render_C.fxl

        This generates a “lookup” function which takes a pointer to the key chars and a separate length. If you have NUL terminated data you can call strlen before calling lookup.

        As far as I can tell the generated code is “optimal” in the sense that it does the fewest number of individual character comparisons needed to reach a point at which strncmp yields a definitive answer.

        1. Patrick says:

          By the way, I could probably use “switch” instead of successive comparisons of the same key character, but I figure the optimizer will already have that covered.

        2. Patrick says:

          I restructured the code a little, and here’s the new file where “compile_pairs” is defined:

          https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index_C/context.fxl

  6. hexgrid says:

    This seems like exactly the sort of thing a critbit trie is for.

    https://github.com/agl/critbit

    1. Did you run a benchmark?

      1. hexgrid says:

        Not recently, but I did when I built one into my game engine a few years back; once assembled, a critbit trie does (on average) one bit compare per log2 strings interned (which can trivially be made branchless), and a single string compare when the appropriate leaf node is found.
        It’s not faster for a set containing one or perhaps two strings, but for even five strings it was the fastest method I found, and I did compare against various other methods.

  7. JY says:

    > You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it

    There’s std::assume in C++23 (https://en.cppreference.com/w/cpp/language/attributes/assume) but currently the only way is to use the compiler builtin functions, e.g.: https://godbolt.org/z/xxK939vca
    I’ve also taken some liberties at rewriting your approach, but the generate assembly should be similar

    1. Transcribing your interesting code:

      #include <algorithm>
      #include <array>
      #include <bit>
      #include <ranges>
      #include <string_view>
      

      static constexpr std::array protocols = { "https", "http", "file", "ftp", "wss", "ws" };

      static constexpr auto string_to_uint64 = [](std::string_view s) constexpr { using CharT = decltype(s)::value_type; static_assert(sizeof(CharT) == 1); std::array<CharT, 8> bytes{}; const auto copy_size = std::min(s.size(), 8ul); std::copy_n(s.cbegin(), copy_size, bytes.begin()); return std::bit_cast<uint64_t>(bytes); };

      static constexpr auto protocol_set = [] { std::array<uint64_t, protocols.size()> ps{}; std::ranges::transform(protocols, ps.begin(), string_to_uint64); return ps; }();

      bool is_special(std::string_view s) { __builtin_assume(s.size() == 8ul); const auto as_uint64{ string_to_uint64(s) }; return std::ranges::count(protocol_set, as_uint64); }

  8. Steinar H. Gunderson says:

    I’ve worked on Chromium’s HTML and CSS parsers, and in general, this is an unsolved problem. I’ve seen all of these, depending on number of strings in the set, whether length is available and other circumstances:

    Switch on first letter, test with strcmp/memcmp
    Switch on length, test with strcmp/memcmp
    Test first + last letter of word as a quick filter, test with strcmp/memcmp
    Perfect hashing (through gperf)
    Hash table lookup
    Check four or eight bytes at a time (as in this post)

    Probably others I forgot as well. One notable thing I haven’t seen is special SIMD instructions. I know Intel wanted to people to use PCMPESTRI and such in the past, but it’s a really slow instruction and SSE4.2 is a bit too new (alas!).

  9. Chris O'Hara says:

    We took a similar approach in our JSON decoder. We found that 8 bytes was too small, but 16 bytes was just right.

    See:
    https://github.com/segmentio/asm/pull/57 (AMD64)
    https://github.com/segmentio/asm/pull/65 (ARM64)
    https://github.com/segmentio/encoding/pull/101

  10. foobar says:

    Here’s a simple perfect hashing solution which seems to beat all variants mentioned on the blog post on my M1. If one would have vectored 64-bit multiply high and variable vector shuffle instructions, this could be even massaged to work on vectors of padded strings:

    const uint8_t table[64] =
    {
    ‘w’, ‘s’, 0, 0, 0, 0, 0, 0,
    ‘f’, ‘t’, ‘p’, 0, 0, 0, 0, 0,
    ‘w’, ‘s’, ‘s’, 0, 0, 0, 0, 0,
    ‘f’, ‘i’, ‘l’, ‘e’, 0, 0, 0, 0,
    ‘h’, ‘t’, ‘t’, ‘p’, 0, 0, 0, 0,
    ‘h’, ‘t’, ‘t’, ‘p’, ‘s’, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, /* comparison always fails */
    0, 0, 0, 0, 0, 0, 0, 0 /* comparison always fails */
    };

    bool mulhi_is_special(std::string_view input) {
    uint64_t inputu = string_to_uint64(input);

    return *(uint64_t *)
    (table +
    (((inputu *
    (unsigned __int128)39114211812342) >> 64) & 0x38)) ==
    inputu;
    }

    1. foobar says:

      The constant above is found by random brute force search, and is actually not very easy to create on ARM assembler. I tried finding constants which would be nicer to load and found them, but interestingly enough this made practically no difference, at least on M1.

    2. foobar says:

      By the way, one could also use Intel PEXT (parallel bit extract) to construct a comparison table lookup address in this specific case. There are 285 four-bit bit position subsets for these strings which uniquely identify each string. On those microarchitectures where PEXT is implemented efficiently this might provide even faster checking, but I suspect applicability of the approach doesn’t scale as well as multiplication, not to mention that ARM lacks the corresponding instruction.

    3. foobar says:

      A perfect hash index for a (16-entry) lookup table can also be constructed from a masked xor of two right shifts of inputu, but strangely enough it is slower than the mulhi variant on M1 when inlined (and equivalent in speed when not). It is curious how efficient the wide multipliers have become for these sort of tricks; even replacing a wide multiplication with a two-instruction sequence of constant shifts and a xor may be counterproductive.

  11. Reini Urban says:

    When I encountered this very same problem, I’ve generated code to do binary search on the padded strings as numbers. And compared it to various perfect hash generators.
    https://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-than-memcmp.html

  12. foobar says:

    The problem of converting an eight-byte NUL-terminated or partial string into an uint64_t where data after NUL is zeroed should be pretty straight-forward with modern vectorisation. The code below vectorises, at least as a lone function, pretty well on Clang 14 on my M1 Mac and with Intel ICX compiler (for Skylake, for instance):


    uint64_t string_to_uint64(const char *str)
    {
    const uint64_t ps = 4096; /* minimum common page size */

    uint8_t outputv[8];
    uint64_t v;
    uint64_t shift;

    if (__builtin_expect(((intptr_t)str & (ps - 1)) > ps - 8, 0))
    {
    uint64_t buf = 0;

    strncpy((char *)&buf, str, sizeof(buf));
    return buf;
    }

    for (size_t i = 0; i < 8; i++)
    {
    outputv[i] = str[i] != 0 ? 0 : ~0;
    }

    v = *(uint64_t *)&outputv;
    shift = v ? __builtin_ctzll(v) : 64;

    return *(uint64_t *)str & ~(~0ULL << shift);
    }

    It generates on both about a dozen instruction fast path, of which about four instructions are just checking for the low likelihood of need to use the slow path, which is for strings which might cross the page boundary. (That part could be still optimized, but it’s not really the low-hanging fruit.)

    Problems arise when one attempts to use other compilers or inlining this to existing code. Suddenly autovectorisation becomes effectively impossible task for the compilers to handle gracefully, which is really quite appalling.

    Nonetheless, even if a compiler would work like a charm, this sort of padding is going to take more time than, for instance, my multiply-mask-load check acting on its output, which takes effectively three instructions (with one multiply) to perform the check…

  13. Shawn says:

    The regular expression versions can be sped up a lot (Though still a lot slower than the other approaches) by using the pattern https?|ftp|file|wss?, eliminating unused capture groups and reducing backtracking. And using a faster engine like RE2 gives an even bigger speedup. I tested a state machine compiled by re2c, too:

    regex 214.626912 ns/string, matches = 6823
    my regex 146.426463 ns/string, matches = 6823
    boost regex 149.997056 ns/string, matches = 6823
    my boost regex 123.283810 ns/string, matches = 6823
    RE2 regex 75.450312 ns/string, matches = 6823
    my RE2 regex 49.995401 ns/string, matches = 6823
    re2c 9.409947 ns/string, matches = 6823

    1. foobar says:

      As long as state space explosion is not an issue one can translate regular expressions to DFAs. (Also, it is very hard to get more performant than this on regular CPUs and general-purpose pattern matching which can’t be easily reduced to hashing.)

      On long strings the bottleneck will be the load latency (but one can theoretically interleave processing of multiple strings). On short ones with unpredictable length branch misprediction is likely to be an issue.

      It might actually be beneficial to write automata which are always run a fixed amount of rounds for the buffer, looping back to the same state on every byte after seeing the NUL termination. This would allow much more instructions to be in flight.

    2. foobar says:

      I implemented a simple hand-written DFA walker with 14 states. Branchless versions of it run at 1.85 to 2.42 ns per string on my M1 Mac (when regex takes around 400 ns similarly to the original blog post).

      Branchy version is predictably slower if branch predictor doesn’t learn the benchmark pattern, but modern branch predictors can remember surprisingly long sequences…

  14. foobar says:

    I think “fast” and “faster” test results are highly dependent of the benchmarking working set size as a result of the capacity of the branch predictor learning the test set, at least on my M1 Mac. When the “simulation” size is increased from 8192 to 65536 their runtimes triple, unlike some other variants in the current repo. See a comparison chart: https://i.imgur.com/KFhuI27.png

    Truly branchless solutions scale in this regard much better, but of course if input is highly skewed and these functions are sufficiently hot in the predictor, “branchy” alternatives are quite competitive. It’s just that even a single branch can make a function “branchy” in this regard.