15th December 2022, 7 min read

Checking for the absence of a string, naive AVX-512 edition

8 thoughts on “Checking for the absence of a string, naive AVX-512 edition”

Geoff Langdale says:

December 16, 2022 at 2:19 am

We do something similar in Hyperscan – entry point for the most heavily used single string search is at:

https://github.com/intel/hyperscan/blob/64a995bf445d86b74eb0f375624ffc85682eadfe/src/hwlm/noodle_engine.c#L221

The pragmatic approach is to look for the first 2 “distinct” characters in the string (i.e. if the string is “aabb” use “ab” for a contiguous 2-character search, not “aa” or “bb”) and follow up with a simple AND/CMP type check on a hit. This runs ‘fast enough’ – single-string search is rarely a bottleneck in Hyperscan.
1. Joern Engel says:
  
  December 17, 2022 at 8:07 am
  
  It has been a while since I read it, but https://blog.burntsushi.net/ripgrep/ had a few neat tricks.
  
  Branches are expensive, so comparing two bytes instead of one before the first branch makes sense. Selecting the most uncommon bytes from the needle reduces the branches taken even further.
2. Mischa Sandberg says:
  
  December 19, 2022 at 9:37 pm
  
  Agreed on looking for two chars. Tried various schemes of choosing them (including the above) and performance varied little; memory bandwidth and cache line boundaries seemed to rule. Settled on the two chars whose first occurrences in the pattern string were the closest to the end.
Matt Palmer says:

December 16, 2022 at 12:52 pm

See the EPSM search algorithm for a very fast SIMD based search.

https://www.sciencedirect.com/science/article/pii/S1570866714000471
Patrick Chkoreff says:

December 16, 2022 at 3:47 pm

In my functional programming language Fexl I’ve been using the following naive version. Note that I use a “string” structure which is a length followed by the actual bytes, so I could have a string of 1000 NULs for example.

/* Search x for the first occurrence of y, starting at offset. Return the
position within x where y was found. If the position returned is past the end
of x, then y was not found. */
unsigned long str_search(string x, string y, unsigned long offset)
{
unsigned long xn = x->len;
unsigned long yn = y->len;

/* Avoid unnecessary work if a match is impossible based on lengths. */
if (xn < yn || offset > xn – yn) return xn;

{
char *xs = x->data;
char *ys = y->data;
unsigned long xi = offset;
unsigned long yi = 0;
while (1)
{
if (yi >= yn) return xi – yi; /* found */
if (xi >= xn) return xn; /* not found */

if (xs[xi] == ys[yi])
yi++;
else
{
xi -= yi;
yi = 0;
}

xi++;
}
}
}

(Tip of the hat for tohtml.com by the way.)

I perused the source code for strstr and it’s pretty amazing. I was surprised to see actual calls to memcmp in it, though I suppose they’re leveraging on some known efficiencies in there.
weineng says:

December 17, 2022 at 6:12 am

your avx2/avx256 implementation requires the CPU to implement avx-512, which my machine did not have. Hacked around with the following solution for avx2. Note that masking don’t play well with avx2 and chars. What I did was to fall back to memcmp for the remaining few comparisons. If anyone could figure out a way to not fall back to memcmp using only avx2 instructions, lmk!

inline const char *find_avx256(const char *in, size_t len, const char *needle,
size_t needle_len) {
size_t i = 0;
for (; i + 32 + needle_len – 1 < len; i += 32) {
__m256i comparator = _mm256_set1_epi8(needle[0]);
__m256i input = _mm256_load_si256(reinterpret_cast(in + i));
int matches = _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
for (size_t char_index = 1; matches && char_index < needle_len; char_index++) {
comparator = _mm256_set1_epi8(needle[char_index]);
input = _mm256_load_si256(reinterpret_cast(in + i + char_index));
matches &= _mm256_movemask_epi8(_mm256_cmpeq_epi8(comparator, input));
}
if(matches) {
return in + i + __builtin_ctz(matches);
}
}
while (i + needle_len – 1 < len) {
if (memcmp(in + i, needle, needle_len) == 0) {
return in + i;
}
++i;
}
return nullptr;
}
Ben Bridgwater says:

December 20, 2022 at 6:12 pm

Not all implementations of strstr() are going to be the same, but certainly a naieve “advance one character if match fails at this position” is far from optimal.

A much smarter string search algorithm is Boyer-Moore which skips past positions where a match cannot exist.

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/
1. Daniel Lemire says:
  
  December 20, 2022 at 7:49 pm
  
  The strstr function in glibc uses an optimized Boyer-Moore-Horspool algorithm.