Daniel Lemire's blog

, 10 min read

For case-insensitive string comparisons, avoid char-by-char functions

13 thoughts on “For case-insensitive string comparisons, avoid char-by-char functions”

  1. seebs says:

    Long ago, in the hot path of a performance critical application, I found a case-insensitive string compare that worked by calling strdup on both strings, then case-smashing them, then comparing the results, then freeing the strings. The typical use case was iterating through a large list of strings comparing a reference string to each of them as a table lookup.

    At the time, I thought per-character comparison would be better. (There wasn’t, yet, a library function for this in that environment.) I think it might have been, given the allocation overhead. These days, I’d have had the objects store a smashed-case version of their strings and compare those.

    1. Note that my post starts with the ASCII context. None of these approaches will work with UTF-8, I think.

      If you assume ASCII, I am not too sure why you’d need to do memory allocations.

      If you need to support unicode, you have other problems that this blog post does not cover. Obviously.

      1. degski says:

        Indeed, what someone explained to me once is that this task (with unicode) is as a matter of principle impossible. There are ‘situations’ where the only way to know what ‘the’ minor is, is to have more context. Let’s just stick to ascii and English.

      2. KWillets says:

        The Unicode Comparison Algorithm is anything but — it seems to be a list of problems that an algorithm should solve, but not a prescriptive method.

        One approach that seems relevant here is to bet that characters will be binary equal more often than they are case-insensitive equal; you use a strcmp-type loop to find the common prefix and then do the expensive collation only on the first code points that mismatch. If you’ve bet correctly, the first mismatch will collate into inequality; otherwise you continue (with binary, not collated comparisons).

        That’s a performance optimization that ICU only belatedly applied to strcoll, and the UCA unfortunately says almost nothing about it.

        With strcasecmp, you could try a loop or vector comparison to find the longest common binary prefix and then apply case conversion only to the next character. Performance is entirely dependent on how often you get mixed-case equality, eg “Daniel” vs. “DANIEL” would have one fast and five slow comparisons, while “Daniel” vs. “daniel” would have one slow and five fast.

        In this instance I doubt it would be faster (it’s certainly branchy), but with more expensive collations it seems like the way to go.

  2. Pete says:

    Strncasecmp is character by character as well under the hood, except that it cheats with a pre-mapped table

    https://github.com/openbsd/src/blob/master/lib/libc/string/strcasecmp.c

    1. I think that glibc must be doing something smarter.

  3. Travis Downs says:

    glibc on my system uses an AVX specialization if the locale is “compatible with ASCII for single bytes” (basically it seems like that means that it does something reasonable if it just uses the ASCII A-Z:a-z mapping and treats all other values without changing the case.

    It uses a 128-bit wide vector algorithm:

    https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/multiarch/strcmp-sse42.S#L201

    You could get 2x as fast with 256-bits, almost certainly. You don’t need the SSE4.2 stuff to detect the null character.

    1. I find it interesting that the same function under macOS seem much less optimized. Possibly a freeBSD/macOS thing. I wonder how well Windows fare?

  4. Christopher Chang says:

    A major practical issue here is that case-insensitive comparisons involve short strings far more often than long ones, so low constant overhead is usually more important for this function than great length->infinity asymptotic behavior.

    1. Do you have a reason to believe that the conclusion would change if we were to benchmark on small strings?

      1. Travis Downs says:

        I do: the glibc versions won’t exhibit their large advantage for short strings because the vectorization advantage goes to zero (and possibly even becomes negative) as the string length goes to zero.

        1. I have updated the C program so that it benchmarks both small and long strings, and though, obviously, the gap closes on small strings, the strncasecmp function is still faster.

  5. David says:

    Any intelligent library will use a faster string-matching algorithm like Boyer-Moore or similar, doing far fewer than N compare operations for strings of length N.

    By lowercasing the strings first, you guarantee you do at least N*2 operations even if the strings don’t match right at the beginning!