The pragmatic approach is to look for the first 2 “distinct” characters in the string (i.e. if the string is “aabb” use “ab” for a contiguous 2-character search, not “aa” or “bb”) and follow up with a simple AND/CMP type check on a hit. This runs ‘fast enough’ – single-string search is rarely a bottleneck in Hyperscan.
Branches are expensive, so comparing two bytes instead of one before the first branch makes sense. Selecting the most uncommon bytes from the needle reduces the branches taken even further.
Mischa Sandbergsays:
Agreed on looking for two chars. Tried various schemes of choosing them (including the above) and performance varied little; memory bandwidth and cache line boundaries seemed to rule. Settled on the two chars whose first occurrences in the pattern string were the closest to the end.
Matt Palmersays:
See the EPSM search algorithm for a very fast SIMD based search.
In my functional programming language Fexl I’ve been using the following naive version. Note that I use a “string” structure which is a length followed by the actual bytes, so I could have a string of 1000 NULs for example.
/* Search x for the first occurrence of y, starting at offset. Return the
position within x where y was found. If the position returned is past the end
of x, then y was not found. */
unsigned long str_search(string x, string y, unsigned long offset)
{
unsigned long xn = x->len;
unsigned long yn = y->len;
/* Avoid unnecessary work if a match is impossible based on lengths. */
if (xn < yn || offset > xn – yn) return xn;
{
char *xs = x->data;
char *ys = y->data;
unsigned long xi = offset;
unsigned long yi = 0;
while (1)
{
if (yi >= yn) return xi – yi; /* found */
if (xi >= xn) return xn; /* not found */
if (xs[xi] == ys[yi])
yi++;
else
{
xi -= yi;
yi = 0;
}
xi++;
}
}
}
(Tip of the hat for tohtml.com by the way.)
I perused the source code for strstr and it’s pretty amazing. I was surprised to see actual calls to memcmp in it, though I suppose they’re leveraging on some known efficiencies in there.
weinengsays:
your avx2/avx256 implementation requires the CPU to implement avx-512, which my machine did not have. Hacked around with the following solution for avx2. Note that masking don’t play well with avx2 and chars. What I did was to fall back to memcmp for the remaining few comparisons. If anyone could figure out a way to not fall back to memcmp using only avx2 instructions, lmk!
inline const char *find_avx256(const char *in, size_t len, const char *needle,
size_t needle_len) {
size_t i = 0;
for (; i + 32 + needle_len – 1 < len; i += 32) {
__m256i comparator = _mm256_set1_epi8(needle[0]);
__m256i input = _mm256_load_si256(reinterpret_cast(in + i));
int matches = _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
for (size_t char_index = 1; matches && char_index < needle_len; char_index++) {
comparator = _mm256_set1_epi8(needle[char_index]);
input = _mm256_load_si256(reinterpret_cast(in + i + char_index));
matches &= _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
}
if(matches) {
return in + i + __builtin_ctz(matches);
}
}
while (i + needle_len – 1 < len) {
if (memcmp(in + i, needle, needle_len) == 0) {
return in + i;
}
++i;
}
return nullptr;
}
Ben Bridgwatersays:
Not all implementations of strstr() are going to be the same, but certainly a naieve “advance one character if match fails at this position” is far from optimal.
A much smarter string search algorithm is Boyer-Moore which skips past positions where a match cannot exist.
We do something similar in Hyperscan – entry point for the most heavily used single string search is at:
https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hwlm/noodle_engine.c#L221
The pragmatic approach is to look for the first 2 “distinct” characters in the string (i.e. if the string is “aabb” use “ab” for a contiguous 2-character search, not “aa” or “bb”) and follow up with a simple AND/CMP type check on a hit. This runs ‘fast enough’ – single-string search is rarely a bottleneck in Hyperscan.
It has been a while since I read it, but https://blog.burntsushi.net/ripgrep/ had a few neat tricks.
Branches are expensive, so comparing two bytes instead of one before the first branch makes sense. Selecting the most uncommon bytes from the needle reduces the branches taken even further.
Agreed on looking for two chars. Tried various schemes of choosing them (including the above) and performance varied little; memory bandwidth and cache line boundaries seemed to rule. Settled on the two chars whose first occurrences in the pattern string were the closest to the end.
See the EPSM search algorithm for a very fast SIMD based search.
https://www.sciencedirect.com/science/article/pii/S1570866714000471
In my functional programming language Fexl I’ve been using the following naive version. Note that I use a “string” structure which is a length followed by the actual bytes, so I could have a string of 1000 NULs for example.
/* Search x for the first occurrence of y, starting at offset. Return the
position within x where y was found. If the position returned is past the end
of x, then y was not found. */
unsigned long str_search(string x, string y, unsigned long offset)
{
unsigned long xn = x->len;
unsigned long yn = y->len;
/* Avoid unnecessary work if a match is impossible based on lengths. */
if (xn < yn || offset > xn – yn) return xn;
{
char *xs = x->data;
char *ys = y->data;
unsigned long xi = offset;
unsigned long yi = 0;
while (1)
{
if (yi >= yn) return xi – yi; /* found */
if (xi >= xn) return xn; /* not found */
if (xs[xi] == ys[yi])
yi++;
else
{
xi -= yi;
yi = 0;
}
xi++;
}
}
}
(Tip of the hat for tohtml.com by the way.)
I perused the source code for strstr and it’s pretty amazing. I was surprised to see actual calls to memcmp in it, though I suppose they’re leveraging on some known efficiencies in there.
your avx2/avx256 implementation requires the CPU to implement avx-512, which my machine did not have. Hacked around with the following solution for avx2. Note that masking don’t play well with avx2 and chars. What I did was to fall back to memcmp for the remaining few comparisons. If anyone could figure out a way to not fall back to memcmp using only avx2 instructions, lmk!
inline const char *find_avx256(const char *in, size_t len, const char *needle,
size_t needle_len) {
size_t i = 0;
for (; i + 32 + needle_len – 1 < len; i += 32) {
__m256i comparator = _mm256_set1_epi8(needle[0]);
__m256i input = _mm256_load_si256(reinterpret_cast(in + i));
int matches = _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
for (size_t char_index = 1; matches && char_index < needle_len; char_index++) {
comparator = _mm256_set1_epi8(needle[char_index]);
input = _mm256_load_si256(reinterpret_cast(in + i + char_index));
matches &= _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
}
if(matches) {
return in + i + __builtin_ctz(matches);
}
}
while (i + needle_len – 1 < len) {
if (memcmp(in + i, needle, needle_len) == 0) {
return in + i;
}
++i;
}
return nullptr;
}
Not all implementations of strstr() are going to be the same, but certainly a naieve “advance one character if match fails at this position” is far from optimal.
A much smarter string search algorithm is Boyer-Moore which skips past positions where a match cannot exist.
https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/
The strstr function in glibc uses an optimized Boyer-Moore-Horspool algorithm.