In my tests, regex is 100x slower. Of course, results will vary based on the underlying implementation.
Nathan Myerssays:
Seems like a good place for a bloom filter.
Presumably the overwhelming majority of words seen don’t match, so you would win by rejecting non-matches quickly.
Jens Alfkesays:
A Bloom filter will require hashing the string six or seven times with different hash functions. And it has false positives, so you still have to do a real set-membership test if the filter returns true.
I didn’t see anything in the post stating that matches will be rare, and in the benchmark code 60% of the strings match, so this doesn’t seem like a good trade-off.
I have added gperf to the benchmark. It is competitive but it is not a match for the fast approaches I propose. This being said, it could maybe be made competitive with some extra engineering.
Wouter Bijlsmasays:
The ‘direct’ example uses bitwise | instead of logical ||, so no short-circuit evaluation? I guess that will affect the timing to 0.5x what you have now (amortised, if the needle is uniformly distributed)
I have updated the blog post to include both versions.
JRLsays:
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes. In practice, the parameter to memcpy should be input.size() instead of sizeof(uint..), but that seems to tank the performance.
I played a bit with the example (on godbolt: zd9c9YM8b), and the “fast” implementation was the most stable (similar duration over multiple runs). The branchless implementation is consistently the fastest, and the hash version is the sweet spot between clarity and performance.
I also tried a lame implementation of a letter graph, and it seems to be somewhere in between “branchless” and “fast”.
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes.
For a general-purpose function, I agree, but if it is part of your own system where you control where the strings come from, then I disagree. Requiring padding to your strings is quite doable if you have fine control over your data structures.
Mark Hahnsays:
I wonder when perfect hashing becomes the winner.
Xoranthsays:
Wojciech Mula has written a note about a similar problem in the past. He found that using a perfect hash function was the fastest way, and that switching character by character was faster than SWAR techniques like the one you mention in this post.
In my functional programming language Fexl I’m using an “index” technique which identifies maximal common prefixes among keys, identifying the optimal sequences of individual three-way character comparisons to execute before doing a final definitive full key comparison.
Here are the operations index_get, index_put, index_del, index_pairs:
My next step is to write a Fexl program which, when given a set of key-value pairs, automatically writes an optimal C function which looks up the value associated with a key. By “optimal” I mean using the index technique, not using a hash function. It would write the C function to be given the length of the key up front, not relying on trailing NUL.
I may also write the index_get, index_put, and other operations in native C instead of Fexl, but it’s not a bottleneck for me so there’s no hurry. I just think auto-generating C code for a particular index value will be interesting. I may even use the result to do the lookup for the set of standard Fexl functions already written directly in C.
OK, I’ve written some code that “compiles” a list of key-value pairs into C code. The keys and values must be strings, and I don’t yet do anything to detect special characters which must be escaped, or multi-line strings.
Here’s a test suite that tries a few different lists:
This generates a “lookup” function which takes a pointer to the key chars and a separate length. If you have NUL terminated data you can call strlen before calling lookup.
As far as I can tell the generated code is “optimal” in the sense that it does the fewest number of individual character comparisons needed to reach a point at which strncmp yields a definitive answer.
By the way, I could probably use “switch” instead of successive comparisons of the same key character, but I figure the optimizer will already have that covered.
Not recently, but I did when I built one into my game engine a few years back; once assembled, a critbit trie does (on average) one bit compare per log2 strings interned (which can trivially be made branchless), and a single string compare when the appropriate leaf node is found.
It’s not faster for a set containing one or perhaps two strings, but for even five strings it was the fastest method I found, and I did compare against various other methods.
JYsays:
> You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it
I’ve worked on Chromium’s HTML and CSS parsers, and in general, this is an unsolved problem. I’ve seen all of these, depending on number of strings in the set, whether length is available and other circumstances:
Switch on first letter, test with strcmp/memcmp
Switch on length, test with strcmp/memcmp
Test first + last letter of word as a quick filter, test with strcmp/memcmp
Perfect hashing (through gperf)
Hash table lookup
Check four or eight bytes at a time (as in this post)
Probably others I forgot as well. One notable thing I haven’t seen is special SIMD instructions. I know Intel wanted to people to use PCMPESTRI and such in the past, but it’s a really slow instruction and SSE4.2 is a bit too new (alas!).
Chris O'Harasays:
We took a similar approach in our JSON decoder. We found that 8 bytes was too small, but 16 bytes was just right.
Here’s a simple perfect hashing solution which seems to beat all variants mentioned on the blog post on my M1. If one would have vectored 64-bit multiply high and variable vector shuffle instructions, this could be even massaged to work on vectors of padded strings:
The constant above is found by random brute force search, and is actually not very easy to create on ARM assembler. I tried finding constants which would be nicer to load and found them, but interestingly enough this made practically no difference, at least on M1.
foobarsays:
By the way, one could also use Intel PEXT (parallel bit extract) to construct a comparison table lookup address in this specific case. There are 285 four-bit bit position subsets for these strings which uniquely identify each string. On those microarchitectures where PEXT is implemented efficiently this might provide even faster checking, but I suspect applicability of the approach doesn’t scale as well as multiplication, not to mention that ARM lacks the corresponding instruction.
foobarsays:
A perfect hash index for a (16-entry) lookup table can also be constructed from a masked xor of two right shifts of inputu, but strangely enough it is slower than the mulhi variant on M1 when inlined (and equivalent in speed when not). It is curious how efficient the wide multipliers have become for these sort of tricks; even replacing a wide multiplication with a two-instruction sequence of constant shifts and a xor may be counterproductive.
The problem of converting an eight-byte NUL-terminated or partial string into an uint64_t where data after NUL is zeroed should be pretty straight-forward with modern vectorisation. The code below vectorises, at least as a lone function, pretty well on Clang 14 on my M1 Mac and with Intel ICX compiler (for Skylake, for instance):
for (size_t i = 0; i < 8; i++)
{
outputv[i] = str[i] != 0 ? 0 : ~0;
}
v = *(uint64_t *)&outputv;
shift = v ? __builtin_ctzll(v) : 64;
return *(uint64_t *)str & ~(~0ULL << shift);
}
It generates on both about a dozen instruction fast path, of which about four instructions are just checking for the low likelihood of need to use the slow path, which is for strings which might cross the page boundary. (That part could be still optimized, but it’s not really the low-hanging fruit.)
Problems arise when one attempts to use other compilers or inlining this to existing code. Suddenly autovectorisation becomes effectively impossible task for the compilers to handle gracefully, which is really quite appalling.
Nonetheless, even if a compiler would work like a charm, this sort of padding is going to take more time than, for instance, my multiply-mask-load check acting on its output, which takes effectively three instructions (with one multiply) to perform the check…
Shawnsays:
The regular expression versions can be sped up a lot (Though still a lot slower than the other approaches) by using the pattern https?|ftp|file|wss?, eliminating unused capture groups and reducing backtracking. And using a faster engine like RE2 gives an even bigger speedup. I tested a state machine compiled by re2c, too:
As long as state space explosion is not an issue one can translate regular expressions to DFAs. (Also, it is very hard to get more performant than this on regular CPUs and general-purpose pattern matching which can’t be easily reduced to hashing.)
On long strings the bottleneck will be the load latency (but one can theoretically interleave processing of multiple strings). On short ones with unpredictable length branch misprediction is likely to be an issue.
It might actually be beneficial to write automata which are always run a fixed amount of rounds for the buffer, looping back to the same state on every byte after seeing the NUL termination. This would allow much more instructions to be in flight.
foobarsays:
I implemented a simple hand-written DFA walker with 14 states. Branchless versions of it run at 1.85 to 2.42 ns per string on my M1 Mac (when regex takes around 400 ns similarly to the original blog post).
Branchy version is predictably slower if branch predictor doesn’t learn the benchmark pattern, but modern branch predictors can remember surprisingly long sequences…
foobarsays:
I think “fast” and “faster” test results are highly dependent of the benchmarking working set size as a result of the capacity of the branch predictor learning the test set, at least on my M1 Mac. When the “simulation” size is increased from 8192 to 65536 their runtimes triple, unlike some other variants in the current repo. See a comparison chart: https://i.imgur.com/KFhuI27.png
Truly branchless solutions scale in this regard much better, but of course if input is highly skewed and these functions are sufficiently hot in the predictor, “branchy” alternatives are quite competitive. It’s just that even a single branch can make a function “branchy” in this regard.
Just out of interest, have you tried to measure a regex-based solution as well?
In my tests, regex is 100x slower. Of course, results will vary based on the underlying implementation.
Seems like a good place for a bloom filter.
Presumably the overwhelming majority of words seen don’t match, so you would win by rejecting non-matches quickly.
A Bloom filter will require hashing the string six or seven times with different hash functions. And it has false positives, so you still have to do a real set-membership test if the filter returns true.
I didn’t see anything in the post stating that matches will be rare, and in the benchmark code 60% of the strings match, so this doesn’t seem like a good trade-off.
A conventional Bloom filter would be overkill here, but a hashing based technique could work.
There is no need for different hash functions with a bloom filter. You just mix in a different salt for each iteration.
But if the list of strings is known in advance as in this example, then gperf would be even simpler.
I have added gperf to the benchmark. It is competitive but it is not a match for the fast approaches I propose. This being said, it could maybe be made competitive with some extra engineering.
The ‘direct’ example uses bitwise | instead of logical ||, so no short-circuit evaluation? I guess that will affect the timing to 0.5x what you have now (amortised, if the needle is uniformly distributed)
The code has both functions. Results are sensitive to the compiler.
I have updated the blog post to include both versions.
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes. In practice, the parameter to memcpy should be input.size() instead of sizeof(uint..), but that seems to tank the performance.
I played a bit with the example (on godbolt: zd9c9YM8b), and the “fast” implementation was the most stable (similar duration over multiple runs). The branchless implementation is consistently the fastest, and the hash version is the sweet spot between clarity and performance.
I also tried a lame implementation of a letter graph, and it seems to be somewhere in between “branchless” and “fast”.
It is a bit unfair to expect the input strings to be padded to 8 or 4 bytes.
For a general-purpose function, I agree, but if it is part of your own system where you control where the strings come from, then I disagree. Requiring padding to your strings is quite doable if you have fine control over your data structures.
I wonder when perfect hashing becomes the winner.
Wojciech Mula has written a note about a similar problem in the past. He found that using a perfect hash function was the fastest way, and that switching character by character was faster than SWAR techniques like the one you mention in this post.
I.e. see
http://0x80.pl/notesen/2022-01-29-http-verb-parse.html
Damn it. Wojciech is always one step ahead of me.
(For people reading this, Wojciech is a collaborator of mine.)
In my functional programming language Fexl I’m using an “index” technique which identifies maximal common prefixes among keys, identifying the optimal sequences of individual three-way character comparisons to execute before doing a final definitive full key comparison.
Here are the operations index_get, index_put, index_del, index_pairs:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/index.fxl
Here is a test suite:
https://github.com/chkoreff/Fexl/blob/master/src/test/b15.fxl
My next step is to write a Fexl program which, when given a set of key-value pairs, automatically writes an optimal C function which looks up the value associated with a key. By “optimal” I mean using the index technique, not using a hash function. It would write the C function to be given the length of the key up front, not relying on trailing NUL.
I may also write the index_get, index_put, and other operations in native C instead of Fexl, but it’s not a bottleneck for me so there’s no hurry. I just think auto-generating C code for a particular index value will be interesting. I may even use the result to do the lookup for the set of standard Fexl functions already written directly in C.
OK, I’ve written some code that “compiles” a list of key-value pairs into C code. The keys and values must be strings, and I don’t yet do anything to detect special characters which must be escaped, or multi-line strings.
Here’s a test suite that tries a few different lists:
https://github.com/chkoreff/Fexl/blob/master/src/test/index_C.fxl
Here’s the reference output for that:
https://github.com/chkoreff/Fexl/blob/master/src/out/index_C
Here’s the Fexl code which does the actual code generation work:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index/render_C.fxl
This generates a “lookup” function which takes a pointer to the key chars and a separate length. If you have NUL terminated data you can call strlen before calling lookup.
As far as I can tell the generated code is “optimal” in the sense that it does the fewest number of individual character comparisons needed to reach a point at which strncmp yields a definitive answer.
By the way, I could probably use “switch” instead of successive comparisons of the same key character, but I figure the optimizer will already have that covered.
I restructured the code a little, and here’s the new file where “compile_pairs” is defined:
https://github.com/chkoreff/Fexl/blob/master/src/test/lib/index_C/context.fxl
This seems like exactly the sort of thing a critbit trie is for.
https://github.com/agl/critbit
Did you run a benchmark?
Not recently, but I did when I built one into my game engine a few years back; once assembled, a critbit trie does (on average) one bit compare per log2 strings interned (which can trivially be made branchless), and a single string compare when the appropriate leaf node is found.
It’s not faster for a set containing one or perhaps two strings, but for even five strings it was the fastest method I found, and I did compare against various other methods.
> You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it
There’s std::assume in C++23 (https://en.cppreference.com/w/cpp/language/attributes/assume) but currently the only way is to use the compiler builtin functions, e.g.: https://godbolt.org/z/xxK939vca
I’ve also taken some liberties at rewriting your approach, but the generate assembly should be similar
Transcribing your interesting code:
I’ve worked on Chromium’s HTML and CSS parsers, and in general, this is an unsolved problem. I’ve seen all of these, depending on number of strings in the set, whether length is available and other circumstances:
Switch on first letter, test with strcmp/memcmp
Switch on length, test with strcmp/memcmp
Test first + last letter of word as a quick filter, test with strcmp/memcmp
Perfect hashing (through gperf)
Hash table lookup
Check four or eight bytes at a time (as in this post)
Probably others I forgot as well. One notable thing I haven’t seen is special SIMD instructions. I know Intel wanted to people to use PCMPESTRI and such in the past, but it’s a really slow instruction and SSE4.2 is a bit too new (alas!).
We took a similar approach in our JSON decoder. We found that 8 bytes was too small, but 16 bytes was just right.
See:
– https://github.com/segmentio/asm/pull/57 (AMD64)
– https://github.com/segmentio/asm/pull/65 (ARM64)
– https://github.com/segmentio/encoding/pull/101
Here’s a simple perfect hashing solution which seems to beat all variants mentioned on the blog post on my M1. If one would have vectored 64-bit multiply high and variable vector shuffle instructions, this could be even massaged to work on vectors of padded strings:
const uint8_t table[64] =
{
‘w’, ‘s’, 0, 0, 0, 0, 0, 0,
‘f’, ‘t’, ‘p’, 0, 0, 0, 0, 0,
‘w’, ‘s’, ‘s’, 0, 0, 0, 0, 0,
‘f’, ‘i’, ‘l’, ‘e’, 0, 0, 0, 0,
‘h’, ‘t’, ‘t’, ‘p’, 0, 0, 0, 0,
‘h’, ‘t’, ‘t’, ‘p’, ‘s’, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, /* comparison always fails */
0, 0, 0, 0, 0, 0, 0, 0 /* comparison always fails */
};
bool mulhi_is_special(std::string_view input) {
uint64_t inputu = string_to_uint64(input);
return *(uint64_t *)
(table +
(((inputu *
(unsigned __int128)39114211812342) >> 64) & 0x38)) ==
inputu;
}
The constant above is found by random brute force search, and is actually not very easy to create on ARM assembler. I tried finding constants which would be nicer to load and found them, but interestingly enough this made practically no difference, at least on M1.
By the way, one could also use Intel PEXT (parallel bit extract) to construct a comparison table lookup address in this specific case. There are 285 four-bit bit position subsets for these strings which uniquely identify each string. On those microarchitectures where PEXT is implemented efficiently this might provide even faster checking, but I suspect applicability of the approach doesn’t scale as well as multiplication, not to mention that ARM lacks the corresponding instruction.
A perfect hash index for a (16-entry) lookup table can also be constructed from a masked xor of two right shifts of inputu, but strangely enough it is slower than the mulhi variant on M1 when inlined (and equivalent in speed when not). It is curious how efficient the wide multipliers have become for these sort of tricks; even replacing a wide multiplication with a two-instruction sequence of constant shifts and a xor may be counterproductive.
When I encountered this very same problem, I’ve generated code to do binary search on the padded strings as numbers. And compared it to various perfect hash generators.
https://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-than-memcmp.html
The problem of converting an eight-byte NUL-terminated or partial string into an uint64_t where data after NUL is zeroed should be pretty straight-forward with modern vectorisation. The code below vectorises, at least as a lone function, pretty well on Clang 14 on my M1 Mac and with Intel ICX compiler (for Skylake, for instance):
uint64_t string_to_uint64(const char *str)
{
const uint64_t ps = 4096; /* minimum common page size */
uint8_t outputv[8];
uint64_t v;
uint64_t shift;
if (__builtin_expect(((intptr_t)str & (ps - 1)) > ps - 8, 0))
{
uint64_t buf = 0;
strncpy((char *)&buf, str, sizeof(buf));
return buf;
}
for (size_t i = 0; i < 8; i++)
{
outputv[i] = str[i] != 0 ? 0 : ~0;
}
v = *(uint64_t *)&outputv;
shift = v ? __builtin_ctzll(v) : 64;
return *(uint64_t *)str & ~(~0ULL << shift);
}
It generates on both about a dozen instruction fast path, of which about four instructions are just checking for the low likelihood of need to use the slow path, which is for strings which might cross the page boundary. (That part could be still optimized, but it’s not really the low-hanging fruit.)
Problems arise when one attempts to use other compilers or inlining this to existing code. Suddenly autovectorisation becomes effectively impossible task for the compilers to handle gracefully, which is really quite appalling.
Nonetheless, even if a compiler would work like a charm, this sort of padding is going to take more time than, for instance, my multiply-mask-load check acting on its output, which takes effectively three instructions (with one multiply) to perform the check…
The regular expression versions can be sped up a lot (Though still a lot slower than the other approaches) by using the pattern
https?|ftp|file|wss?
, eliminating unused capture groups and reducing backtracking. And using a faster engine like RE2 gives an even bigger speedup. I tested a state machine compiled by re2c, too:As long as state space explosion is not an issue one can translate regular expressions to DFAs. (Also, it is very hard to get more performant than this on regular CPUs and general-purpose pattern matching which can’t be easily reduced to hashing.)
On long strings the bottleneck will be the load latency (but one can theoretically interleave processing of multiple strings). On short ones with unpredictable length branch misprediction is likely to be an issue.
It might actually be beneficial to write automata which are always run a fixed amount of rounds for the buffer, looping back to the same state on every byte after seeing the NUL termination. This would allow much more instructions to be in flight.
I implemented a simple hand-written DFA walker with 14 states. Branchless versions of it run at 1.85 to 2.42 ns per string on my M1 Mac (when regex takes around 400 ns similarly to the original blog post).
Branchy version is predictably slower if branch predictor doesn’t learn the benchmark pattern, but modern branch predictors can remember surprisingly long sequences…
I think “fast” and “faster” test results are highly dependent of the benchmarking working set size as a result of the capacity of the branch predictor learning the test set, at least on my M1 Mac. When the “simulation” size is increased from 8192 to 65536 their runtimes triple, unlike some other variants in the current repo. See a comparison chart: https://i.imgur.com/KFhuI27.png
Truly branchless solutions scale in this regard much better, but of course if input is highly skewed and these functions are sufficiently hot in the predictor, “branchy” alternatives are quite competitive. It’s just that even a single branch can make a function “branchy” in this regard.