Daniel Lemire's blog

, 23 min read

Sentinels can be faster

22 thoughts on “Sentinels can be faster”

  1. Mark says:

    Some of this sounds like facebook’s std::string:

    https://www.youtube.com/watch?v=kPR8h4-qZdk

    The sentinel in C bests their null termination?

  2. -.- says:

    NULL terminated strings are potentially also a security vulnerability if not handled properly. Other than making it easier to accidentally do buffer overflows, they’re subject to NULL byte injection attacks. They also can’t properly support binary data in general (since you’re going to have data with NULLs in them), so you need to have support for a length-specified variant anyway.

    Speed wise, they can be problematic, particularly if you’re looking to parallelise code (e.g. SIMD). It can also effectively turn character iteration into a pointer-chasing problem – the CPU only knows it can check the next character if the current character isn’t NULL (and a compiler can’t help you), though, for a lot of string handling stuff, it’s probably the same, unless you have fixed length fields, or known length processing, e.g. strcmp, or are willing to put effort in to work around it, e.g. SIMD.
    Of course, you can always trivially append a value at the end if it would benefit your code.

    1. Travis Downs says:

      I agree with most of this comment, but not the “can also effectively turn character iteration into a pointer-chasing problem”. It is true that it prevents easy vectorization because you can’t be sure if you can access byte n+1 until you’ve checked byte n – but it is materially different that pointer chasing since there is no data dependency there, only a well-predicted control dependency. That’s the difference between say 1 cycle per byte versus 5 cycles per byte.

      Vectorization is still possible, but you have to make some tough choices: e.g., aligning your reads and using the fact you won’t cross a a page to do a read that is technically out of bounds, but known safe. Library code like strlen() does exactly that – but they have to get whitelisted in tools like ASAN and Valgrind which would otherwise complain. For libraries other than the standard library, this is less palatable since you’ll probably have to convince users to add the whitelist rules yourself.

  3. Denis says:

    Perhaps this article is relevant.

    1. You are correct! It is the same idea expressed in similar terms.

  4. Raj Darman says:

    C newbie here, so apologies for my limited understanding.

    We pass const char * start to the skip_leading_spaces() function, and inside we increment the start pointer. But this is a const char *, so we shouldn’t be allowed to increment it, right?

    1. You are confusing const char * start with char * const start.

      1. Jonathan O'Connor says:

        To clarify for Raj, const char* means the char can’t change, whereas char* const means the pointer can’t change.

        You can even have const char* const where both the pointer and the pointed to char are const.

        Yes, at first this is confusing, but it’ll become natural after some more months writing C or C++.

  5. This is a complicated tradeoff. Sentinels can be faster, but they can also be slower.

    In general I found that they are a great tool to micro-optimize parsers, but they make scanning using SIMD hard – unless the CPU provides instructions to load data that don’t trigger fault handlers, or you’re willing to cheat by detecting page boundaries (which doesn’t work with address sanitizer / valgrind), null-terminated strings prevent ability to scan multiple characters at a time.

    E.g. scanning an integer out of a string can be done with a few SIMD ops (load 16 bytes, subtract ‘0’, range check vs 10, movemask, find first 0 bit) if you know the 16 byte load is safe, but that’s hard to establish without an explicit size.

  6. Noooo! Byte at a time is a sure way to kill performance.
    – read glibc str/mem routines for your favorite architecture.
    – understand them
    – profit

    I did some of the initial glibc vectorized versions for x86_64 back in 2010 while at AMD, and saw more than a 10x speedup for non-trivial strings. Bigger speedups for longer strings. Glibc developers have significantly improved upon those versions since then.

  7. Rodriguez says:

    Unless you are running these tests in a processor without branch prediction I honestly don’t understand the 40% difference – I would have expected a very small (if any) difference. Actually, in your benchmarking code it looks you generate the data once…, not two identical copies (in different memory addresses) of the data for the ‘regular’ and ‘sentinel’ case, so I think that that is the source of the 40%…

    1. Travis Downs says:

      Well predicted comparisons are not free: at a minimum there are still instructions and uops to make the comparison and take the branch. It can get worse from there: the compiler might not know which is the likely branch and it can end up organizing code so that the fast path has an always-taken branch which slows things down even more, etc.

      A 40% improvement from removing one compare-and-branch from a loop that is extremely small, basically consisting of a single load and another compare-and-branch is not surprising.

  8. Rodriguez says:

    I compiled your code (and changed the order in which the tests are exercised), run it many times and I could not get more than 5% difference between the two with -O2 and -O3. So, I was wrong in that the difference would be small – 5% is significant to me, but for the 40% there must be some reason in how the code is generated or executed (which I don’t have the time to analyze further).

    1. Here are my numbers on an AMD Rome processor:

      [dlemire@rome 09]$ c++ --version
      c++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
      Copyright (C) 2018 Free Software Foundation, Inc.
      This is free software; see the source for copying conditions.  There is NO
      warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
      

      [dlemire@rome 09]$ make c++ -O3 -std=c++11 -o sentinel sentinel.cpp [dlemire@rome 09]$ ./sentinel N = 10000 range 1313.74 sentinel 892.865 ratio 1.47138 @

      Here are the numbers on a 2008 imac:

      $ c++ --version
      Apple clang version 11.0.3 (clang-1103.0.32.62)
      Target: x86_64-apple-darwin19.6.0
      Thread model: posix
      InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
      lemire@iMac-2018 ~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09 $ make
      c++ -O3 -std=c++11 -o sentinel sentinel.cpp
      lemire@iMac-2018 ~/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09 $ ./sentinel
       N = 10000
      range 1040.48
      sentinel 725.152
      ratio 1.43484
      @
      

      Here are my numbers on a Linux-based skylake (Intel) box:

      $ c++ --version
      c++ (Ubuntu 8.4.0-1ubuntu1~16.04.1) 8.4.0
      Copyright (C) 2018 Free Software Foundation, Inc.
      This is free software; see the source for copying conditions.  There is NO
      warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
      

      dlemire@skylake:/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09$ make c++ -O3 -std=c++11 -o sentinel sentinel.cpp dlemire@skylake:/CVS/github/Code-used-on-Daniel-Lemire-s-blog/2020/09$ ./sentinel N = 10000 range 1194.83 sentinel 871.197 ratio 1.37148 @

      1. Rodriguez says:

        I tried your latest version in my computer (which is “old”, an Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, c++/gcc 9.3.0) and got:

        $ make
        c++ -O3 -std=c++11 -o sentinel sentinel.cpp
        $ ./sentinel
        N = 10000
        range 1411.8
        sentinel 1353.81
        ratio 1.04284
        @

        I see that you get the ~40% and I am sure that that outcome is really happening.

        However, it is very difficult to explain the reason why (such a large percentage) in a general way. Both the range condition and is_space() require the new value of start (which just takes an addition), whereas the main latency is incurred in bringing the character from cache/memory and checking the table within is_space(). A modern speculative processor will already have computed the next “start++”, performed the check “start != end” (but not committed), and could in principle even initiate the next memory/cache request before is_space() finishes, also, being sequential access, the data for the speculative path is also probably in the cache. If I do the loop without calling is_space(), it takes 16 ns, i.e., the overhead of the range check alone is very small, and the overhead of ‘is_space()’ we also know it (the ‘sentinel’ version). I changed the order of the comparisons for the range check to:

        while (is_space(*start) && (start != end))

        and got even a smaller difference (~1%):

        $ ./sentinel-v
        N = 10000
        range 1348.07
        sentinel 1337.33
        ratio 1.00803
        @

        Because I am pretty sure that the 40% is actually happening, this has me puzzled, puzzled because the underlying reason has to be more complex than just having two comparisons. I possibility that crossed my mind is that the overhead we are seeing with the range comparison could be related to the fact that we are comparing “chars” and maybe this causes a barrier effect as the next character is aliased in the same word and maybe this messes with speculation, the compiler, or both (I’m just speculating now, as I can’t go much further into that rabbit hole, but interesting anyway).

        1. If you email me, I’ll give you access to one of my servers.

          whereas the main latency is incurred in bringing the character from cache/memory

          Memory/cache access latency is almost surely irrelevant for this benchmark. We are reading sequentially.

          while (is_space(*start) && (start != end))

          I expect that this is incorrect code.

          puzzled because the underlying reason has to be more complex than just having two comparisons

          Here is the assembly of the sentinel version…

              movzx   ecx, byte ptr [rax + 1]
              add     rax, 1
              cmp     byte ptr [rcx + is_space(unsigned char)::table], 0
              jne     .LBB0_1
              ret
          

          You can retire easily 4 instructions per cycle. Depending on the processor, there may be a limit at home many jumps you can make (e.g., you could be limited to one every two cycles). I believe however that in this case recent Intel and AMD processors should be able to sustain one jump per cycle given how short the loop is.

          Note that a sandy bridge processor like yours is not something I have recent experience with (it is 10 years old).

          Here is the assembly of the other version…

              movzx   ecx, byte ptr [rax]
              cmp     byte ptr [rcx + is_space(unsigned char)::table], 0
              je      .LBB1_4
              add     rax, 1
              cmp     rsi, rax
              jne     .LBB1_1
              mov     rax, rsi
          

          I would expect that the compare/jump instructions get fused. However, you surely can’t execute two fused compare/jump instruction per cycle. So the long version has a limit of two cycles per iteration.

          So I get a speed limit of one iteration per cycle for the sentinel version and a speed limit of two iteration every two cycles for the regular version. Of course, these are speed limitations and we probably do not manage to achieve this ideal performance.

          1. Rodriguez says:

            Wow, that was a very valuable deep dive. Many thanks. This exercise was more valuable than I would have thought initially.

            Now I think I see it. It doesn’t matter how much speculative work the processor could have done, in this case, if I understand it correctly, we are gated by the number and type of instructions that can be retired per cycle, and there lies, for me, an important takeaway. It is a very illustrative example to teach about the state of modern microarchitecture.

            Also, I agree with all other points (sequential access, means that accessing the data is not the limiting factor), also, when inverting the comparisons I was sloppy (I cannot think of an excuse for that). My Sandy Bridge still feels very powerful if I don’t compare it against any modern processor 🙂

            1. I have not “reasoned out” the performance to my satisfaction. It would have taken me quite a few more minutes. I just wanted to illustrate that the 40% difference is credible.

              For such a small benchmark, it should be possible to model the performance entirely and know exactly the expected performance.

              A sandy bridge processor is really not very far from current processors as far as the main architectural features are concerned but on microbenchmarks, there can certainly be sizeable differences.

              1. Rodriguez says:

                The ways in which complexity leaks out of anything even when it looks short and simple never ceases to overwhelm me.

                At the beginning, the 40% looked to me too extraordinary for a simple explanation – I wanted to know where could it come from, exactly.

                BTW, I came across https://www.agner.org/optimize/microarchitecture.pdf, which actually contains a section on the bottlenecks of each microarchitecture that I found interesting (assuming it is an accurate report).

  9. Rickey Bowers says:

    mov al,’ ‘
    repz scasb
    jz @F
    dec rdi

    I’d probably inline it. (c: The CISC goodness of x86 has been neglected by CPU manufacturers for decades – with good reason. Instructions like JRCXZ make handling Pascal type strings easy though.

    The main benefit of not using sentinels is that the zero byte can be used for something else – typically, binary data needing the full range of values.

    Each method has their uses: sentinel, count or range.

    1. Rodriguez says:

      I inlined the code you suggested and tested in my laptop (I know that could be considered a sin, an Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, c++/gcc 9.3.0) and found no difference between the generated C code or the inlined CISC instructions; I am curious if you get different results in a different machine and why the results would be different or not (I can imagine ways in which it could be slower or faster, I just don’t know enough of each concrete microarchitecture to figure out why before running the benchmark).

      I did it like this:

      const char *skip_leading_spaces_asm1(const char *start) {
      asm("mov al, ' ';\n\t"
      "repz scasb;\n\t"
      "jz 1f;\n\t"
      "dec rdi;\n\t"
      "1:\n"
      : "+D"(start)
      : "m"(*(const char(*)[])start), "c"(-1)
      : "cc", "al");
      return start;
      }

      Or alternatively (effectively the same code, but using Extended ASM’s features better to set ‘al’):

      const char *skip_leading_spaces_asm(const char *start) {
      asm volatile("repz scasb;\n\t"
      "jz 1f;\n\t"
      "dec rdi;\n\t"
      "1:\n"
      : "+D"(start)
      : "m"(*(const char(*)[])start), "a"(' '), "c"(-1)
      : "cc");
      return start;
      }

      I believe that the code is correct (I printed the ‘start’ at then end of each run and compared it against the C++ function’s output). I would nonetheless welcome that if something in the code is dangerous or wrong or not best practice, or could be radically improved, it’d be pointed out.

      I modified the Makefile like this:

      tolower: sentinel.cpp
      c++ -O3 -masm=intel -std=c++11 -o sentinel sentinel.cpp
      c++ -O3 -masm=intel -std=c++11 -S -o sentinel.asm sentinel.cpp

      This was one output:

      ./sentinel
      N = 10000
      range 1413.99
      sentinel 1340.31
      sentinel_asm 1354.13
      ratio range/sentinel 1.05497
      ratio range/sentinel_asm 1.04421
      `

      1. Rickey Bowers says:

        Thanks for testing. I didn’t expect it to be faster in this context – just as small as the call overhead of a separate function and more featureful. Timing such a small piece of code is error prone. From a practical standpoint: using such a routine is surely to be followed by a response for an empty string – which the forward branch would also handle.

        There is another thing though. Intel’s processor design has religated the complex CICS instructions to reduced silicon instead of improving their speed – for a couple decades. This is because compilers don’t generate these instructions. Further examples are: LOOP*, JCXZ, etc.. Exceptions are REP STOS/MOVS*, which have been optimized in recent generations.

        JCXZ and LOOP seem to have not been neglected on AMD processors.

        Unpopular instructions are not optimized. In extreme cases they are repurposed (like BOUND, and the BCD instructions). The unfortunate consequence has us programming like RISC machines – which are boring, imho.