, 34 min read
Avoid lexicographical comparisons when testing for string equality?
37 thoughts on “Avoid lexicographical comparisons when testing for string equality?”
, 34 min read
37 thoughts on “Avoid lexicographical comparisons when testing for string equality?”
Yeah, I wrote a bunch of memequal_k() template specializations to get around some of the fixed-length small cases clang wasn’t handling so well, along with a replacement memequal() function; you can see my code at https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_base.h#L2332 and https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_base.cc#L1701 .
To clarify, the code that foobar linked is an assembly code (.S) file that is used to generate the library version of the memcmp call. It isn’t used by the compiler as a “builtin” or anything to inline memcmp.
I’ve chased it down in gcc to __builtin_memcmp_eq, which is ironically best described in a golang issue: https://github.com/golang/go/issues/21618 .
It’s a tight fit to the situation we see here: constant length, with equality or inequality to 0 instead of the ternary result.
So the way gcc (and maybe clang) handles this is specifically by recognizing memcmp and checking whether a only a 2-way result is needed and then essentially replacing it with a memcmp_eq call.
So it’s hardcoded with knowledge of the semantics of memcmp. This raises the interesting question of if you wrote a function like memcmp, whether the generic optimizer could do this same optimization – this is a harder problem since it has to recognize that the output of your memcmp function is compared to zero and eliminate code that differentiates between < 0 and > 0.
There are three variants of memcmp{0,1,2} which vary only in the return statement when a difference is found.
memcmp0 is the normal way you’d write it: simply bswaps the mismatched words, and subtracts them.
memcmp1 is the same, but it adds a __builtin_unreachable “assertion” that ret is non-zero (we know this is true, since if a != b, bswap(a) != bswap(b), but maybe the compiler needs help).
memcmp2 returns (bswap(a) - bswap(b)) | 1
which doesn’t change the sign of the answer, but is another way to help compiler know the answer is non-zero.
Then we call each of these functions in my_bcmp{0,1,2} which returns a bool rather than int, and see what happens.
The score was gcc 2, clang 1, MSVC 0 (it is hopeless here: it doesn’t even inline bswap). Nobody could optimize the memcmp0 case: they all left the bswap calls in and compared the result. gcc was able to optimize either of the other two cases, and clang only memcmp2 | 1
. When the optimization worked, it could generate exactly or almost exactly the same code as the explicit bcmp code. Cool.
Note that the | 1
version isn’t free: when you actually use the 3-way result, you get a useless extra or instruction in the result path.
The optimization that allows is called “range analysis” or something like that: where the compiler keeps track of what it knows about a variable, such as what bits are set, where it is definitely zero or non-zero, or the possible range of values.
It’s the bswap here that “breaks” range analysis from working in the plain memcmp0 case: the compiler knows that left != right
, and it knows that implies left - right != 0
, but it doesn’t know that it equally implies bswap(left) - bswap(right) != 0
. If you take out the bswap, both gcc and clang are able to optimize my_bcmp0.
Evidently neither compiler has a
An equality-only memcmp is nearly exactly the same.
I think that my tests, and yours, show that there is a difference, at least as far as some compilers and data distributions are concerned. An equality comparison is an easier problem (for the compiler, for the programmer, for the hardware).
The difference might be small, for some definition of small which goes down to zero in some cases. But a small difference multiplied by enough cases ends up making a difference.
To prove me wrong, someone simply has to propose a piece of lexicographical comparison C code that consistently beats an optimized equality comparison, across a wide range of compilers and data distribution. If someone has that, then I will gladly retract my blog post.
I think that my tests, … show that there is a difference, at least
as far as some compilers and data distributions are concerned.
Not at all. They show that gcc is deciding to call library memcmp through the PLT boundary and that a totally generic library call that doesn’t know the fixed length length of the comparison does terribly against an inlined function written for a hardcoded length.
If you flip the situation around, and write an inlined lexicographic compare and call a library generic memeq function (bcmp could do) the results will approximate reverse themselves.
Yes, lexicographic comparison is somewhat more difficult than plain equality comparison, but this test doesn’t show that at all except by coincidence.
To prove me wrong, someone simply has to propose a piece of
lexicographical comparison C code that consistently beats an optimized
equality comparison, across a wide range of compilers and data
distribution. If someone has that, then I will gladly retract my blog
post.
This is also wrong.
No one that I can see is claiming that lexicographic comparison is faster than plain equality comparison (outside of implementation quality), and it is fairly obvious that if you had a super fast lexicographic comparison you can turn it a super fast equality comparison for with approximately zero work, so lexicographic comparison is never really going to be faster.
Your claim here is that equality search is faster than lexicographic search when only equality is needed. The negation of that is that equality search is slower or equal to lexicographic search. We can toss out the slower part, but all someone has to do to prove you wrong is hit the bar for “equals”.
You might not get to exactly equals, and the data distribution really matters here, but it will be close and much closer than your misleading test suggests.
There is also the little issue of the interface between the 3-way result and the ultimate use as a 2-way result: clearly if you needed a 3-way result in the first place, there is no comparison: you need to use a 3-way function like memcmp. The motivating example, however, only needed a 2-way result and that was expressed in the source as !memcmp()
so that information is available to the compiler, and indeed better compilers than the one git was using at the time can use this info to produce better code.
Your test stores all the results into an array and actually uses the 3-wayness of the result so that will never happen in this case, but it isn’t clear if that a more likely scenario than memcmp() == 0
type scenarios.
IMO when an error is pointed out in the test methodology, you should correct it rather than demanding that someone else write the code that perhaps you should have written in the first place!
This stackoverflow shows memcmp is inlined to word comparisons for the expression memcmp() == 0. For length 20 it does two 8-byte and one 4-byte xor: https://godbolt.org/z/GuqNxJ .
Why does your memeq20 function copy the data and do the comparison on the copies? You aren’t modifying the data, so why not just compare it in place?
…
uint64_t * w1 = (uint64_t )s1;
uint64_t * w2 = (uint64_t *)s2;
bool answer1 = (w1[0] == w2[0]);
bool answer2 = (w1[1] == w2[1]);
uint32_t * ww1 = (uint32_t *)(s1 + sizeof(uint64_t) * 2);
uint32_t * ww2 = (uint32_t *)(s2 + sizeof(uint64_t) * 2);
bool answer3 = (ww1 == *ww2);
…
Or something like that.
The code you are proposing is probably relying on undefined behaviours. It is unsafe.
I was wondering why you were using memcpy instead of a simple pointer arithmetic. With pointer arithmetic, you can get the benefit simpler code and the possibility to get a small speedup if checking previous answer.
To make the code legal C. Although it works in practice, it is not legal to access an array of one type (char *) with a pointer to another type (uint64_t *), except in special circumstances.
Yes, this is fine in assembly, but C is not assembly (although in practice current compilers often generates the assembly you expect – but who knows about tomorrow!).
It doesn’t cause a problem here beacuse the memcpy is transformed back into a plain load by most compilers.
Your compiler optimizes away the memcpy. Without it, you risk undefined behaviours.
Shouldn’t ww1 and ww2 be uint32_t?
Yes, it was a typographical error.
Interesting. I’d think a naive copy of memcmp would take two operations per word, testing each for (a .lt b) and than if that failed, (a .eq. b), which would make it twice as slow as your memeq in the worst case of identical strings. And indeed, your version appears to be twice as fast.
But a better memcmp would be to walk down the string and just test test each word for equality, and than only test (a .lt. b) when the equality test failed. In that case, you’d think both would be more or less as fast, as memcmp would only have one more operation then memeq. So I’m kind of surprised by this result.
That is true for some definition of “more or less”.
You are right … “in the long run”. Certainly most memcmp implementations will sequentially check one large word from each input and when they differ do “a bit more” to return the lexicographic result. They do not do the naive thing of comparing < and > for each pair.
It’s not necessarily just “one more operation” though, especially on little-endian machines: if you know that the two inputs differ for some eight byte word, how do you return the right 3-way result? The easy way is to subtract, but this gives the right answer only on big-endian. On little endian you can swap the byte order and then subtract, but 64-bit bswap is not super fast even on recent Intel.
There are perhaps other tricks possible, but current compilers prefer bswap.
Vectorization of memcmp is another challenge: if you compare a 256-bit “word” and find it unequal, you can’t subtract them since there are no 256-bit wide subtractions. So again it’s more complex. Asymptotically they are the same though.
Finally, memeq gives you another advantage: you could check the bytes in any order you want. One could imagine an adaptive or application-specific memeq that knows strings usually differ at the start or the end so checks them in an order that takes advantage of that. memcmp can’t.
That said, the tests here show these effects: the memcmp slowdown shown in the numbers above is more or less entirely because gcc decided to call into libc memcpy which is much slower.
Finding the distinguishing bit in little-endian int’s is about 3 instructions:
z = x^y
bool x_is_greater = x & (z ^ z - 1)
I don’t think it works.
z ^ (z – 1) sets all bits up to the lowest set bit, so the whole thing returns “does x have any bits set from bit 0 through the lowest bit that differs with y”? which is clearly wrong. As an example, x = 1, y = 2:
z = 3
z ^ (z-1) = 1
x & (z ^ (z-1)) = 1
So you get the wrong answer (x is not greater in the only byte x and y differ).
Maybe you wanted something like (z & -z) which isolates the lowest set bit, then you can do your & to see which of x or y had the differing bit set. It still doesn’t work: you want to look at the top bit in the differing byte.
That’s the problem with all the bit-tricks: there are two totally different directions you need to look: first you need to scan from the LSB(yte) to the MSB to find the lowest differing byte, but then you need to scan in the opposite direction within that byte from MSB(bit) to LSB(it) to find the distinguishing bit.
Since most bit tricks don’t do anything special wrt bytes, this is hard, and why bswap is used. You can probably still do it, maybe along the line of the strlen bithack that finds a zero byte, but I don’t think it will be 3 instructions.
It’s a shame bswap is 2 uops for 64-bit values on Intel, otherwise the bswap approach would be just 3 uops (2 bswap, subtract). On Zen, it’s one op and you can do 4 per cycle! Even better on Zen you can do MOVBE from memory in a single op, versus 3 on Intel.
With significance reversed, 1 > 2.
I see – bytewise, significance is the same.
Yes that’s the problem. Note that even ignoring that issue, the original code didn’t return a meaningful result: it returns true for aost any two values. It returns true for 1,3 and it also returns true for 3,1!
Yup, I didn’t see a better alternative to bswap when I looked into this a few months ago, either. Just make sure to only bswap after you’ve found the unequal words; the standard library implementation I saw made the mistake of bswapping before the equality check.
Yeah I saw that too. On modern Intel it’s definitely better to do the bswap in the finishing up code, but one could argue that on AMD Zen you could do two MOVBE every iteration for free in place of regular mov, since it is only only uop so about the same cost as a plain load “in principle” (I haven’t tested it). Hopefully Intel will up their BSWAP game in the future.
A big limitation of bswap is that it doesn’t obviously place nice with vectorised comparison loops, since the differing word is greater than 64-bits. Some movmaskb and trailing zero count can give you the different byte and you can load it alone and you can do the subtraction. Reasonable throughput but bad latency. I guess you could do the comparison in the vector domain, too.
Coincidentally I corresponded with Derrick on another possible optimization at the beginning of the year which ran into similar issues with memcmp.
I tried doing git index search with Manber and Myers’s String Binary Search algorithm, but it turned out that whatever byte comparisons we saved by skipping the longest common prefix were eaten up by using bytewise comparison, since we couldn’t use memcmp to get LCP.
My recollection is that memcmp is heavily tuned, with versions for each length if known at compile time.
The poor performance of memcmp here has more or less nothing to do with “lexicographic comparison” and almost everything to do with the fact that memcmp doens’t get inline so (a) has to do a function call through the PLT and then has to (b) implement a totally generic algorithm without the benefit of knowing that the string has a fixed length.
That’s apples and oranges!
Lexicographic comparison does have some cost, but it is quite small and, importantly is a fixed cost per call: the guts of a typical memcmp does an search for non-equal bytes and then subtracts the non-equal bytes to get the required ternary result. An equality-only memcmp is nearly exactly the same.
Also in this case, a fast path optimization of comparing first word of the hash values (?) and going to slow path only if they match would skip a lot of potential generic logic inside memcmp.
I think reasonably up-to-date glibc source for amd64 generic memcmp can be found at Sourceware repo. By looking at that code it is pretty obvious that generic memcmp is pretty slow for many inputs. For instance deciding that buffers shorter than 32 bytes differ on the first byte take 19 assembler instructions to execute, and more than 32 bytes requires more even on the best case! Good statically generated fast path (which compilers at godbolt.org don’t really seem to generate that eagerly) would be a well-behaving three-instruction sequence…
The trick is to ask for memcmp(..,20) == 0 — it will generate wordwise compares inline instead of looking for the distinguishing prefix.
Unfortunately I don’t know how this relates to the code you linked — does that hidden builtin macro help with the inlining, or is there some other code generation going on (in gcc)?
At least on gcc, this doesn’t seem to be true. All the gcc versions from about 4.6 onwards I checked on x86 seem to always generate an actual call to memcmp (and before 4.6 they did inline code but did a bad job of it, even with static lengths and when comparing to zero – they always used rep cmp type stuff rather than just a few explicit loads).
Perhaps this is as a consequence of this bug where someone claims “library memcmp is really fast, just always call the function” where this is plainly not true for small fixed lengths. I’m kind of surprised gcc fails badly here: it does much better on inlining fixed length memcpy for example (and Daniel’s code relies on that by doing several 8-byte memcpy that just get transformed into single 64-bit loads by the compiler).
clang does better: if I compile the benchmark with clang 5.0, the results reverse: I get that memcmp is faster than memeq20. Note that this test is a bit specific in that the compared strings are randomly generated, so 99.6%+ of the time they will differ at the first byte, so it is really testing the fast path how the first byte or word or whatever is handled.
The clang generated code isn’t all that great either, but it gets some things right: if you compare against 0, it removes both the special final logic to calculate the ternary result, and also bswap calls in the loop relating to the ternary result (but the bswap calls arguably shouldn’t have been in the loop in the first place).
Two nicely aligned fixed-size buffers in a struct and a fixed comparison length of 64 bytes resulted rather interesting code: it started for gcc with two 64-bit reads from both buffers, XORing corresponding pairs and testing those results with OR. Well, that’s correct and reasonably efficient start (unless you use AVX, which clang would use if architecture is specified correctly), but in the case of hashes that usually differ, just comparing leading words – or probably even bytes, to avoid misalignment penalties – would have been faster.
So, for specific use cases, hand-written comparison routine beats those reasonably good compiler-generated variants. It’s hard to tell to the compiler that bits on buffers are random and different 99% of the time, but 1% of the time probably all of them match between them! (If this would be the opposite, it would probably be beneficial to write a XOR-OR chain without any conditional branches, at least for reasonably short inputs.)
Is the code you wrote for memcmp or bcmp though? That is, does it return a 3-way or 2-way (boolean) result? How did you do the 3-way part if three way?
I agree that in this test the first word will almost always differ (always if its 64-bits) so the key is make the fastest possible code for that case. For clang, this means that comparing 24 bytes is actually faster than 16, since doing the “odd” word before vectorizing happens to line up with the fast path you’d want. See my comment here.
This is a bit artificial, but nonetheless, it seems to show different approaches (and failures) by different compilers: godbolt.org code.
That’s really interesting. It seems like I was (somewhat) incorrectly slandering gcc’s ability to inline constant length memcmp() as your example shows.
Luckily struct isn’t necessary: it works for plain pointers too. The key seems to be that the result has to be used non-lexicographically: if the calling code differentiates between the < 0 and > 0 cases, it will call memcmp. So this reinforces the idea of this blog post in a way: but partly due to a quirk of gcc: it also implies that the original code that saw the slowdown probably would have been fine on a more recent version of gcc.
clang behaves somewhat differently – it can either either style, but with quite different code: it vectorizes the boolean versions, and uses scalar code with movbe for when the full ternary result is used. The clang code seems suboptimal: it could just use plain moves and do the bswap in the tail part, and it should just subtract the differing words rather than the conditional stuff it is doing.
Note that this is exactly what KWillets said originally, so my objection was out of line. It does work when you compare memcmp against zero, at least for most recent gcc versions (seems like the inlining started around 7.0 or 7.1).
I should credit this stackoverflow: https://stackoverflow.com/questions/45052282/why-is-memcmpa-b-4-only-sometimes-optimized-to-a-uint32-comparison (I thought I posted a more coherent comment last night with this and the above points, but apparently not).