Daniel Lemire's blog

, 36 min read

How quickly can you remove spaces from a string?

40 thoughts on “How quickly can you remove spaces from a string?”

  1. John says:

    I’m interested in what the benchmark is for this

    size_t despace(char * bytes, size_t howmany) {
    char* src = bytes;
    char* dest = bytes;

    for(size_t i = 0; i < howmany; i++) {
    char c = *src++;

    if (c == '\r' || c == '\n' || c == ' ') {
    continue;
    }
    *dest++ = c;
    }
    return dest – bytes;
    }

  2. Nathan Kurz says:

    Nice demo, but why would it be foolish to expect it to be 16x faster? Have you already lost faith in the power of magic vector dust? Relative to memcpy, what do you think the fundamental limit should be here?

    I may be doing something wrong, and I realize it isn’t exactly the point of your illustration, but I get about a 2x speedup if I remove the “if (mask16 == 0)” conditional from your code to make it branchless.

    nate@skylake:~/git/despacer$ despacebenchmark
    sse4_despace(buffer, N): 0.818359 cycles / ops
    sse4_despace_branchless(buffer, N): 0.400391 cycles / ops

    As you’d expect, the advantage of avoiding the branchless approach grows if you increase the probability of spaces (.94 vs .40 at 2%), and decreases as spaces become rarer (.47 vs .40 at .5%).

    I wouldn’t assume that loop unrolling would have a significant effect here, although I also wouldn’t assume without looking that the compiler isn’t already doing it.

    It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.

    1. Nice demo, but why would it be foolish to expect it to be 16x faster?

      I prefer to be pessimistic so that I am rarely disappointed.

      It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.

      Yes. Good idea.

  3. Thomas A. Fine says:

    “It is going to be faster as long as most blocks of eight characters do not contain any white space.”

    This doesn’t seem very true for most data sets.

    Also, I wouldn’t spend all that effort on bitwise checks and register copies, and instead focus on your basic algorithm. You are copying every byte twice. (Unless you are relying on optimization to magically fix that, but that sort of dodges the whole point of coding faster algorithms, in my opinion.) A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
    I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.

    1. You are copying every byte twice.

      I am not sure what you mean? In the simple scalar code, I read each byte once, and I write most of them back.

      A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.

      I did consider an implementation based on memmove, but you will need many memmove calls in general.

      I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.

      I’d very much like to see your code.

      1. Thomas A. Fine says:

        When I say you are copying every byte twice, I mean this:
        char c = bytes[i];

        bytes[pos++] = c;
        Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.

        Here’s the perl comparison:
        —-
        #!/usr/bin/perl

        $text=join(” “,@ARGV);

        $howmany=length($text);

        print “lemire: {“, despace_lemire($text,$howmany), “} “;
        $t=time;
        for ($n=0; $n<1000000; ++$n) {
        despace_lemire($text,$howmany);
        }
        print time-$t, "\n";

        print "mine: {", despace_mine($text,$howmany), "} ";
        $t=time;
        for ($n=0; $n<1000000; ++$n) {
        despace_mine($text,$howmany);
        }
        print time-$t, "\n";

        sub despace_lemire {
        my $text=shift(@_);
        my $howmany=shift(@_);
        my $i;
        my $c;
        my $pos=0;
        for ($i=0; $i<$howmany; ++$i) {
        $c=substr($text,$i,1);
        next if ($c eq " ");
        substr($text,$pos++,1)=$c;
        }
        #truncate to new length (perl doesn't copy nulls)
        $text=substr($text,0,$pos);
        return($text);
        }

        sub despace_mine {
        my $text=shift(@_);
        my $howmany=shift(@_);
        my $i=0;
        my $c;
        my $from;
        my $to;

        #skip initial non-space section
        $i=0;
        while(substr($text,$i,1) ne " ") { ++$i; }

        $to=$i;
        #skip whitespace we found
        while(substr($text,$i,1) eq " ") { ++$i; }

        $from=$i;
        while($i= $howmany); }

        #copy
        substr($text,$to,$i-$from) = substr($text,$from,$i-$from);
        $to += $i-$from;

        #skip whitespace
        while(substr($text,$i,1) eq ” “) { last if (++$i >= $howmany); }

        $from=$i;
        }
        #truncate to new length (perl doesn’t copy nulls)
        $text=substr($text,0,$to);
        return($text);
        }

        1. When I say you are copying every byte twice, I mean this:
          char c = bytes[i];
          …
          bytes[pos++] = c;
          Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.

          I would argue that in C, these are semantically equivalent…

          char c = bytes[i];
          bytes[pos] = c;

          and

          bytes[pos] = bytes[i];

          Regarding how this gets compiled… on all processors that I am familiar with, they both get compiled to two “move” instructions. You first have to retrieve “bytes[i]” to a register, then you need to write it to memory. There might be processors that support moving memory from RAM to RAM without going through a register, but I am not familiar with them.

          There are certainly instructions (on x64 processors) that work directly on memory, without loading the data into a register (the bit test instruction one such instruction). However, it is not necessarily faster and can even be much slower. I believe that in a lot of important cases, the compiler will get it right.

          My understanding is that there is no such instruction on ARM processors, they always need to load the data in memory. So there is a load, an operation and then a store.

          But let us focus on x64 processors… Is there any difference between how an x64 compiler will handle

          const uint64_t mask = 1111;
          a[0] &= mask;

          and

          a[0] &= 1111;

          ???

          Do you have any example where it produces different machine code?

  4. Darrell Wright says:

    I wonder if it would be faster look at 32bits at a time. At least in english, a bulk of the words are in the 3-5 letter length or under.

  5. Julien says:

    It seems you are not considering alignment.

    Try the following: allocate and memory align a buffer, memcpy the string into it and then do the SIMD job on the memory aligned string.

    (Consider how the cache works, if you try move 64 bytes starting on on address 63, then it has to read not 64 bytes but 128 bytes, so be sure to align so that the starting address of the allocated buffer is evenly divisibly by 64 – do that by over-allocating and then move the pointer appropriately)

    1. Please see Data alignment for speed: myth or reality? http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

      1. Julien says:

        Thanks, interesting!

        Not quite the improvement I thought it would be. Yet it definitely did result in an improvement on my machine (Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz)

        align to: 3
        sse42_despace_branchless_lookup(buffer, N): 1.058594 cycles / ops
        align to: 7
        sse42_despace_branchless_lookup(buffer, N): 1.001953 cycles / ops
        align to: 63
        sse42_despace_branchless_lookup(buffer, N): 0.990234 cycles / ops
        align to: 2
        sse42_despace_branchless_lookup(buffer, N): 0.963867 cycles / ops
        align to: 8
        sse42_despace_branchless_lookup(buffer, N): 0.896484 cycles / ops
        align to: 16
        sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
        align to: 32
        sse42_despace_branchless_lookup(buffer, N): 0.568359 cycles / ops
        align to: 64
        sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
        align to: 128
        sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
        align to: 256
        sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops

        1. Julien says:

          Let me clarify that. Aligning to 64 – you are completely right. Alignment being a “myth” – not so much, at least on my machine.

          1. Billy O'Neal says:

            Note that this answer will depend on the chip you use for testing. When I implemented a similar algorithm for UTF-8 to UTF-16 transformation, doing extra work to align things was moderately better on Haswell-E (Xeon E5-1650v3), but was no gain on Skylake-T (Core i7 6700T). In general, the newer chips seem to deal with unaligned access better than older ones.

        2. I modified my benchmark so that you can add an alignment offset…

          Results on my Skylake server…

          $ ./despacebenchmark
          pointer alignment = 16 bytes
          memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
          countspaces(buffer, N):  3.673828 cycles / ops
          despace(buffer, N):  5.578125 cycles / ops
          faster_despace(buffer, N):  1.740234 cycles / ops
          despace64(buffer, N):  2.542969 cycles / ops
          despace_to(buffer, N, tmpbuffer):  5.585938 cycles / ops
          avx2_countspaces(buffer, N):  0.365234 cycles / ops
          avx2_despace(buffer, N):  3.593750 cycles / ops
          sse4_despace(buffer, N):  0.816406 cycles / ops
          sse4_despace_branchless(buffer, N):  0.386719 cycles / ops
          sse4_despace_trail(buffer, N):  1.539062 cycles / ops
          sse42_despace_branchless(buffer, N):  0.603516 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  0.675781 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  1.707031 cycles / ops
          

          $ ./despacebenchmark 1 alignment offset = 1 pointer alignment = 1 bytes memcpy(tmpbuffer,buffer,N): 0.109375 cycles / ops countspaces(buffer, N): 3.675781 cycles / ops despace(buffer, N): 5.576172 cycles / ops faster_despace(buffer, N): 1.693359 cycles / ops despace64(buffer, N): 2.589844 cycles / ops despace_to(buffer, N, tmpbuffer): 5.492188 cycles / ops avx2_countspaces(buffer, N): 0.373047 cycles / ops avx2_despace(buffer, N): 3.535156 cycles / ops sse4_despace(buffer, N): 0.841797 cycles / ops sse4_despace_branchless(buffer, N): 0.398438 cycles / ops sse4_despace_trail(buffer, N): 1.599609 cycles / ops sse42_despace_branchless(buffer, N): 0.605469 cycles / ops sse42_despace_branchless_lookup(buffer, N): 0.679688 cycles / ops sse42_despace_to(buffer, N,tmpbuffer): 1.718750 cycles / ops

          $ ./despacebenchmark 4 alignment offset = 4 pointer alignment = 4 bytes memcpy(tmpbuffer,buffer,N): 0.109375 cycles / ops countspaces(buffer, N): 3.679688 cycles / ops despace(buffer, N): 5.630859 cycles / ops faster_despace(buffer, N): 1.685547 cycles / ops despace64(buffer, N): 2.562500 cycles / ops despace_to(buffer, N, tmpbuffer): 5.523438 cycles / ops avx2_countspaces(buffer, N): 0.373047 cycles / ops avx2_despace(buffer, N): 3.603516 cycles / ops sse4_despace(buffer, N): 0.843750 cycles / ops sse4_despace_branchless(buffer, N): 0.400391 cycles / ops sse4_despace_trail(buffer, N): 1.552734 cycles / ops sse42_despace_branchless(buffer, N): 0.603516 cycles / ops sse42_despace_branchless_lookup(buffer, N): 0.681641 cycles / ops sse42_despace_to(buffer, N,tmpbuffer): 1.681641 cycles / ops

          Results on a much older Ivy Bridge (like yours):

          $ ./despacebenchmark
          pointer alignment = 16 bytes
          memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
          countspaces(buffer, N):  14.519531 cycles / ops
          despace(buffer, N):  12.612305 cycles / ops
          faster_despace(buffer, N):  7.248047 cycles / ops
          despace64(buffer, N):  10.275391 cycles / ops
          despace_to(buffer, N, tmpbuffer):  11.271484 cycles / ops
          sse4_despace(buffer, N):  2.039062 cycles / ops
          sse4_despace_branchless(buffer, N):  1.081055 cycles / ops
          sse4_despace_trail(buffer, N):  4.904297 cycles / ops
          sse42_despace_branchless(buffer, N):  1.511719 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  1.554688 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  3.882812 cycles / ops
          

          $ ./despacebenchmark 1 alignment offset = 1 pointer alignment = 1 bytes memcpy(tmpbuffer,buffer,N): 0.339844 cycles / ops countspaces(buffer, N): 13.337891 cycles / ops despace(buffer, N): 12.699219 cycles / ops faster_despace(buffer, N): 7.292969 cycles / ops despace64(buffer, N): 10.285156 cycles / ops despace_to(buffer, N, tmpbuffer): 11.269531 cycles / ops sse4_despace(buffer, N): 2.214844 cycles / ops sse4_despace_branchless(buffer, N): 1.749023 cycles / ops sse4_despace_trail(buffer, N): 5.027344 cycles / ops sse42_despace_branchless(buffer, N): 2.109375 cycles / ops sse42_despace_branchless_lookup(buffer, N): 2.187500 cycles / ops sse42_despace_to(buffer, N,tmpbuffer): 3.890625 cycles / ops

          $ ./despacebenchmark 4 alignment offset = 4 pointer alignment = 4 bytes memcpy(tmpbuffer,buffer,N): 0.339844 cycles / ops countspaces(buffer, N): 14.507812 cycles / ops despace(buffer, N): 12.699219 cycles / ops faster_despace(buffer, N): 7.214844 cycles / ops despace64(buffer, N): 10.214844 cycles / ops despace_to(buffer, N, tmpbuffer): 11.289062 cycles / ops sse4_despace(buffer, N): 2.162109 cycles / ops sse4_despace_branchless(buffer, N): 1.757812 cycles / ops sse4_despace_trail(buffer, N): 5.044922 cycles / ops sse42_despace_branchless(buffer, N): 2.109375 cycles / ops sse42_despace_branchless_lookup(buffer, N): 2.302734 cycles / ops sse42_despace_to(buffer, N,tmpbuffer): 3.875000 cycles / ops

          There are ways to trigger performance issues due to alignment on recent x64 processors, but you almost have to go out of your way to do it.

  6. Julien says:

    It seems you are not memory aligning your buffers.

    Memory align a temporary buffer that you memcpy the input string to. After that do the operation on the temporary buffer.

    (Consider doing a 64 byte move starting on address 63. The CPU would have to read/write not 64 bytes but 2×64 bytes – since the address was not aligned.).

  7. Anton says:

    Interesting.
    I run a despace program on a GTX 1080 :
    https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5f29ee

    Seems like the gpu can do around 0.25 cycles/byte which is better than SIMD code but not quite as good as memcpy.
    The code also looks simpler than SIMD code.

  8. Paolo says:

    In the naive implementation you copy even the bytes before the first space. They don’t need to be copied because they’re already there. Two loops could make it faster.

    1. @Paolo

      Can you write a faster version?

  9. exDM69 says:

    This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.

    The biggest positive benefit of writing code like this (good CPU use in a memory bound problem) is giving the CPU an opportunity to do some hyperthreading while the current CPU thread is waiting for memory.

    I wrote some more insights from my experiences in similar optimization tasks over at HN: https://news.ycombinator.com/item?id=13450245

    I added wall clock time measuring with clock_gettime http://pasteall.org/208511

    Here are the results:

    memcpy(tmpbuffer,buffer,N): 0.125130 cycles / ops 1451124749 nsec (1.451125 sec)
    countspaces(buffer, N): 3.710938 cycles / ops 1672809379 nsec (1.672809 sec)
    despace(buffer, N): 6.603971 cycles / ops 1606222818 nsec (1.606223 sec)
    faster_despace(buffer, N): 1.688668 cycles / ops 1547048647 nsec (1.547049 sec)
    despace64(buffer, N): 3.590088 cycles / ops 1602308741 nsec (1.602309 sec)
    despace_to(buffer, N, tmpbuffer): 6.306158 cycles / ops 1630737083 nsec (1.630737 sec)
    avx2_countspaces(buffer, N): 0.191219 cycles / ops 1442104811 nsec (1.442105 sec)
    avx2_despace(buffer, N): 5.754352 cycles / ops 1579367806 nsec (1.579368 sec)
    sse4_despace(buffer, N): 0.984717 cycles / ops 1486145973 nsec (1.486146 sec)
    sse4_despace_branchless(buffer, N): 0.339487 cycles / ops 1498728832 nsec (1.498729 sec)
    sse4_despace_trail(buffer, N): 1.958952 cycles / ops 1479234359 nsec (1.479234 sec)
    sse42_despace_branchless(buffer, N): 0.563217 cycles / ops 1534317675 nsec (1.534318 sec)
    sse42_despace_branchless_lookup(buffer, N): 0.624996 cycles / ops 1563285415 nsec (1.563285 sec)
    sse42_despace_to(buffer, N,tmpbuffer): 1.747602 cycles / ops 1525936772 nsec (1.525937 sec)

    1. This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.

      On recent Intel processors, rdtsc measures the wall-clock time though maybe more accurately due to low overhead.

  10. Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.

    Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.

    1. Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.

      Thanks, I fixed that. It does not impact performance. In this case, my code passed sanity tests because of the statistics of my test.

      Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.

      This makes sense. I didn’t want to bother with an explanation in the course of my blog post because *why* it works is not so important.

  11. Alex N says:

    I’m not at a proper computer atm. Can you try

    1. Load 8bytes into xmm.
    2. Compare for equality with spaces, etc.
    3. Move the mask to a general purpose register.
    4. Negate bits in that register.
    5. Copy selected bits with PEXT (from bmi2 set) instruction.
    6. Use POPCNT to advance a pointer.

    Thanks,
    Alex

    1. Using PEXT is interesting. We have 16 bytes to deal with. PEXT only operates on 8 bytes. So we need two, with two pointer offsets. It might be fast, but it is not obviously faster than a SIMD-based approach, I think.

      1. Alex N says:

        One advantage of PEXT approach is that it doesn’t use a table.

        Alex

        1. @Alex PEXT require: a mask, you either generate it on the fly or you look it up. I think you envision generating it on the fly… I’d be interested in the approach you’d take.

  12. aqrit says:

    What is the performance of a simple sorting network
    using SIMD?

    Would it be better to define the problem as “shift all the whitespace to side of the vector”?

    1. I am not sure which algorithm you have in mind. Sorting networks can be implemented just fine using SIMD, but I am not sure it helps for this problem.

  13. There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.

    To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.

    Anyway, that table is huge, 256 KiB. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

    1. There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.

      I don’t think that there is an error, if there is, please point it out more precisely. In the blog post, I omit the cast which is present in the GitHub repository, but it is a technical detail that despace_mask16 is not of type __m128i * (that is, you can rewrite the code so that it is).

      To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.

      In C, _mm_set_epi8 is not a compile-time constant so you cannot do “__m128i array[] = {_mm_set1_epi8(0)};”. You must be thinking of C++.

      It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

      I think that the table is 1MB.

    2. Anyway, that table is huge. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

      Please see the GitHub repository, function sse4_despace_branchless_mask8. It is a tad slower. There is an extra load, an extra mask, an extra shift and and an extra XOR. All of these things appear to add up.

  14. Billy O'Neal says:

    What is / can you explain `despace_mask16`? I don’t understand the bit twiddling hack going on here to give the right constant to the shuffle unit.

    1. @Billy

      What is / can you explain `despace_mask16`?

      We are using the _mm_shuffle_epi8 intrinsic to move the bytes around so that flagged characters are omitted. This intrinsic takes an input register “a” as well as a control mask “b” and it outputs a new vector
      (a[b[0]], a[b[1]], a[b[2]], a[b[3]], …). Computing the mask “b” on the fly is hard given the locations of the characters, so we use a look-up table.

      It is not pretty.

      1. Billy O'Neal says:

        >so we use a look-up table

        Derp on my part — for some reason I read that initially as supplying that constant to _mm_shuffle_epi8 — I missed the load instruction. Lookup table makes sense, thanks!

  15. Eric says:

    Where did `despace_mask16` come from there? Surely that’s not an intel builtin!

    1. I don’t give the full code within the blog post, you can find it on GitHub:

      https://github.com/lemire/despacer

  16. Brad Daniels says:

    It looks to me like the SIMD code won’t copy the last few bytes if the size is not a multiple of 16. Am I missing something?

    1. In my blog post, I provide only the gist of the code, please see the GitHub repo for actual working code.

  17. Thanks, this is a pretty nice post. I used your examples to see the impact of mispredicted branches in Python (*): I somehow was surprised of their impact on this short piece of code. Thanks again!

    (*) Here the link to my python notebook:
    https://github.com/stegua/MyBlogEntries/blob/master/Remove%2Bblanks%2Bfrom%2Ba%2Bstring.ipynb

  18. Peter says:

    I am stating the obvious, but “Leffmann” approach is not doing the same thing as the original.

    It depends on an unseen pre-processing stage that correctly fills an array with 1s and 0s. I expect that in practice it will in fact be slower when this stage is also accounted for.