21st May 2021, 11 min read

Counting the number of matching characters in two ASCII strings

14 thoughts on “Counting the number of matching characters in two ASCII strings”

Marc Bommert says:

May 21, 2021 at 3:45 pm

An optimizing compiler will turn these function calls into a single
load instruction.

Mmmh, I would not rely on that. The strinsg have to be aligned to a multiple of 8 byte to do so. How should the compiler know? Maybe in cases where this function is inlined (which would require it to be qualified with static visibility) but not on a general basis I guess.
1. Daniel Lemire says:
  
  May 21, 2021 at 5:33 pm
  
  The strinsg have to be aligned to a multiple of 8 byte to do so.
  
  Compilers are hardware-aware are most hardware you might care about (x64, 64-bit ARM) can load data from any address, aligned or not. You should expect these memcpy function call to be compiled to a single instruction. And it is a fast instruction. See my post Data alignment for speed: myth or reality? for more details.
  1. Marc Bommert says:
    
    May 21, 2021 at 8:01 pm
    
    You are right. I’m in a different domain with mostly 32-Bit ARMs, PPC, MIPS, AVR, where the compiler would generated particular instructons for byte, word and dword access or you would at least get a significant perfomance penalty in the silicon from unaligned access in case the chip is capable of it. Has to be considered and measured for a specific hardware to be really sure here. Thanks for the article. Good to know modern CPUs seem to be so much more merciful here.
    1. Daniel Lemire says:
      
      May 21, 2021 at 8:34 pm
      
      You are correct, on some old or small processors, there might be a significant performance penalty to unaligned loads, but even then, you’d expect the compiler to basically get the job done. If the platform lacks an unaligned load instruction, it will have to do some computations to figure out the alignment, do two loads, and fuse the result. It will be harmful.
      
      But most people do not program for such small processors.
      
      It used to be that “mobile” computing meant “small and limited”, but even an old iPhone can do unaligned loads and more.
      1. Marc Bommert says:
        
        May 21, 2021 at 9:39 pm
        
        it will have to do some computations to figure out the alignment
        
        Well, it can’t. That is what I was referring to initially.
        The memcpy implementation will have to respect the aligment dynamically during runtime.
        
        The addresses of s1 and s2 are not known since the function is exported. LTO may catch this, though. Unsure.
        
        Daniel Lemire says:
        
        May 21, 2021 at 10:07 pm
        
        The memcpy implementation will have to respect the aligment dynamically during runtime.
        
        Well, it cannot know the alignment at compile-time, but it could figure it out at runtime.
        
        However, thinking twice about the problem, I think that what an optimizing compiler might do is load 8 bytes (using 8 instructions) and then combine them using shifts. It is going to be expensive at least in terms of instruction count.
        
        What I described at first would not work because it would entails reading more bytes than necessary.
Daniel deLamare says:

May 21, 2021 at 5:06 pm

Pardon my ignorance, but couldn’t the memcpy be avoided if we instead type x and y and uint64_t* and assign the address instead of copying the bytes? For example: x = (uint64_t*)c1 + i; Or would this be the same or worse after being put through an optimizing compiler?
1. Daniel Lemire says:
  
  May 21, 2021 at 5:37 pm
  
  couldn’t the memcpy be avoided if we instead type x and y and uint64_t* and assign the address instead of copying the bytes?
  
  You have to load the data into a register one way or another. A load cannot avoided. Some x64 instructions can take an address as a parameter, but there is effectively a load happening in any case…
  
  I use memcpy to avoid undefined behaviours. Doing a cast to a wider type is not good C code. You have to use a memcpy or the equivalent (in C, C++).
  
  Or would this be the same or worse after being put through an optimizing compiler?
  
  If you are using undefined behaviour, the risk is that the compiler could do something wrong.
  
  Otherwise, I have no way to think that avoid a memcpy could ever speed things up.
  1. Vincenzo Romano says:
    
    May 21, 2021 at 7:22 pm
    
    ASCII only
    1. Daniel Lemire says:
      
      May 21, 2021 at 9:06 pm
      
      I do not understand the comment.
      
      Yes, my code above effectively compares bytes so as-is it assumes ASCII. You can rather easily extend it to UTF-16 if you check for surrogate pairs. Matching UTF-8 characters would be a fun challenge. But then matching UTF-8 would be tricky no matter what.
      
      If you decode to UTF-32 first, then matching the characters fast is the last of your concerns.
Steve Fink says:

May 21, 2021 at 9:48 pm

Note that you specified ASCII, so the high bit will always be clear, so t1 will always be 0x8080808080808080LL.

But (1) your version works for Latin1, and (2) it may be too hard to come up with a new name for matching_bytes_in_word. 😉

(I also don’t know if compilers can convert the last line of that function into a popcnt instruction on their own?)
1. Daniel Lemire says:
  
  May 21, 2021 at 10:00 pm
  
  You are correct. My code is more general than just ASCII. I actually did not try to optimize for ASCII.
  
  I also don’t know if compilers can convert the last line of that function into a popcnt instruction on their own?
  
  I am skeptical.
  
  I only wanted to rely on standard C for this post. You could indeed just call popcnt which, on recent x64 hardware, is probably much cheaper than a shift followed by a multiplication followed by a shift. But I’d be impressed if a compiler could figure that out.
  
  The trouble is that if I can assume popcnt, I may as well assume SIMD instructions and the whole problem takes a different turn.
Dru Nelson says:

May 21, 2021 at 10:14 pm

return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;

That one line is quite a gem. I haven’t seen that trick before.

Thanks for this post Daniel.
Falk Hüffner says:

May 22, 2021 at 8:53 am

zeros in matching_bytes_in_word can be calculated with one instruction less (although it’s unlikely to affect speed much):

uint64_t matching_bytes_in_word(uint64_t x, uint64_t y) { uint64_t xor_xy = x ^ y; uint64_t zeros = (((xor_xy >> 1) | 0x8080808080808080) - xor_xy) & 0x8080808080808080; return ((zeros >> 7) * 0x0101010101010101ULL) >> 56; }