20th January 2017, 36 min read

How quickly can you remove spaces from a string?

40 thoughts on “How quickly can you remove spaces from a string?”

John says:

January 20, 2017 at 9:17 pm

I’m interested in what the benchmark is for this

size_t despace(char * bytes, size_t howmany) {
char* src = bytes;
char* dest = bytes;

for(size_t i = 0; i < howmany; i++) {
char c = *src++;

if (c == '\r' || c == '\n' || c == ' ') {
continue;
}
*dest++ = c;
}
return dest – bytes;
}
Nathan Kurz says:

January 20, 2017 at 9:36 pm

Nice demo, but why would it be foolish to expect it to be 16x faster? Have you already lost faith in the power of magic vector dust? Relative to memcpy, what do you think the fundamental limit should be here?

I may be doing something wrong, and I realize it isn’t exactly the point of your illustration, but I get about a 2x speedup if I remove the “if (mask16 == 0)” conditional from your code to make it branchless.

nate@skylake:~/git/despacer$ despacebenchmark
sse4_despace(buffer, N): 0.818359 cycles / ops
sse4_despace_branchless(buffer, N): 0.400391 cycles / ops

As you’d expect, the advantage of avoiding the branchless approach grows if you increase the probability of spaces (.94 vs .40 at 2%), and decreases as spaces become rarer (.47 vs .40 at .5%).

I wouldn’t assume that loop unrolling would have a significant effect here, although I also wouldn’t assume without looking that the compiler isn’t already doing it.

It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
1. Daniel Lemire says:
  
  January 20, 2017 at 10:04 pm
  
  Nice demo, but why would it be foolish to expect it to be 16x faster?
  
  I prefer to be pessimistic so that I am rarely disappointed.
  
  It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
  
  Yes. Good idea.
Thomas A. Fine says:

January 20, 2017 at 9:43 pm

“It is going to be faster as long as most blocks of eight characters do not contain any white space.”

This doesn’t seem very true for most data sets.

Also, I wouldn’t spend all that effort on bitwise checks and register copies, and instead focus on your basic algorithm. You are copying every byte twice. (Unless you are relying on optimization to magically fix that, but that sort of dodges the whole point of coding faster algorithms, in my opinion.) A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
1. Daniel Lemire says:
  
  January 20, 2017 at 10:00 pm
  
  You are copying every byte twice.
  
  I am not sure what you mean? In the simple scalar code, I read each byte once, and I write most of them back.
  
  A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
  
  I did consider an implementation based on memmove, but you will need many memmove calls in general.
  
  I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
  
  I’d very much like to see your code.
  1. Thomas A. Fine says:
    
    January 23, 2017 at 7:45 pm
    
    When I say you are copying every byte twice, I mean this:
    char c = bytes[i];
    …
    bytes[pos++] = c;
    Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
    
    Here’s the perl comparison:
    —-
    #!/usr/bin/perl
    
    $text=join(” “,@ARGV);
    
    $howmany=length($text);
    
    print “lemire: {“, despace_lemire($text,$howmany), “} “;
    $t=time;
    for ($n=0; $n<1000000; ++$n) {
    despace_lemire($text,$howmany);
    }
    print time-$t, "\n";
    
    print "mine: {", despace_mine($text,$howmany), "} ";
    $t=time;
    for ($n=0; $n<1000000; ++$n) {
    despace_mine($text,$howmany);
    }
    print time-$t, "\n";
    
    sub despace_lemire {
    my $text=shift(@_);
    my $howmany=shift(@_);
    my $i;
    my $c;
    my $pos=0;
    for ($i=0; $i<$howmany; ++$i) {
    $c=substr($text,$i,1);
    next if ($c eq " ");
    substr($text,$pos++,1)=$c;
    }
    #truncate to new length (perl doesn't copy nulls)
    $text=substr($text,0,$pos);
    return($text);
    }
    
    sub despace_mine {
    my $text=shift(@_);
    my $howmany=shift(@_);
    my $i=0;
    my $c;
    my $from;
    my $to;
    
    #skip initial non-space section
    $i=0;
    while(substr($text,$i,1) ne " ") { ++$i; }
    
    $to=$i;
    #skip whitespace we found
    while(substr($text,$i,1) eq " ") { ++$i; }
    
    $from=$i;
    while($i= $howmany); }
    
    #copy
    substr($text,$to,$i-$from) = substr($text,$from,$i-$from);
    $to += $i-$from;
    
    #skip whitespace
    while(substr($text,$i,1) eq ” “) { last if (++$i >= $howmany); }
    
    $from=$i;
    }
    #truncate to new length (perl doesn’t copy nulls)
    $text=substr($text,0,$to);
    return($text);
    }
    1. Daniel Lemire says:
      
      January 23, 2017 at 10:00 pm
      
      When I say you are copying every byte twice, I mean this:
      char c = bytes[i];
      â€¦
      bytes[pos++] = c;
      Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
      
      I would argue that in C, these are semantically equivalent…
      char c = bytes[i]; bytes[pos] = c;
      and
      bytes[pos] = bytes[i];
      
      Regarding how this gets compiled… on all processors that I am familiar with, they both get compiled to two “move” instructions. You first have to retrieve “bytes[i]” to a register, then you need to write it to memory. There might be processors that support moving memory from RAM to RAM without going through a register, but I am not familiar with them.
      
      There are certainly instructions (on x64 processors) that work directly on memory, without loading the data into a register (the bit test instruction one such instruction). However, it is not necessarily faster and can even be much slower. I believe that in a lot of important cases, the compiler will get it right.
      
      My understanding is that there is no such instruction on ARM processors, they always need to load the data in memory. So there is a load, an operation and then a store.
      
      But let us focus on x64 processors… Is there any difference between how an x64 compiler will handle
      const uint64_t mask = 1111; a[0] &= mask;
      and
      a[0] &= 1111;
      ???
      
      Do you have any example where it produces different machine code?
Darrell Wright says:

January 20, 2017 at 11:20 pm

I wonder if it would be faster look at 32bits at a time. At least in english, a bulk of the words are in the 3-5 letter length or under.

Julien says:

January 21, 2017 at 12:11 am

It seems you are not considering alignment.

Try the following: allocate and memory align a buffer, memcpy the string into it and then do the SIMD job on the memory aligned string.

(Consider how the cache works, if you try move 64 bytes starting on on address 63, then it has to read not 64 bytes but 128 bytes, so be sure to align so that the starting address of the allocated buffer is evenly divisibly by 64 – do that by over-allocating and then move the pointer appropriately)

Daniel Lemire says:

January 21, 2017 at 2:18 am

Please see Data alignment for speed: myth or reality? http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

Julien says:

January 21, 2017 at 10:29 pm

Thanks, interesting!

Not quite the improvement I thought it would be. Yet it definitely did result in an improvement on my machine (Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz)

align to: 3
sse42_despace_branchless_lookup(buffer, N): 1.058594 cycles / ops
align to: 7
sse42_despace_branchless_lookup(buffer, N): 1.001953 cycles / ops
align to: 63
sse42_despace_branchless_lookup(buffer, N): 0.990234 cycles / ops
align to: 2
sse42_despace_branchless_lookup(buffer, N): 0.963867 cycles / ops
align to: 8
sse42_despace_branchless_lookup(buffer, N): 0.896484 cycles / ops
align to: 16
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 32
sse42_despace_branchless_lookup(buffer, N): 0.568359 cycles / ops
align to: 64
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
align to: 128
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 256
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops

Julien says:

January 21, 2017 at 10:30 pm

Let me clarify that. Aligning to 64 – you are completely right. Alignment being a “myth” – not so much, at least on my machine.
1. Billy O'Neal says:
  
  January 22, 2017 at 6:31 am
  
  Note that this answer will depend on the chip you use for testing. When I implemented a similar algorithm for UTF-8 to UTF-16 transformation, doing extra work to align things was moderately better on Haswell-E (Xeon E5-1650v3), but was no gain on Skylake-T (Core i7 6700T). In general, the newer chips seem to deal with unaligned access better than older ones.

Daniel Lemire says:

January 23, 2017 at 3:52 pm

I modified my benchmark so that you can add an alignment offset…

Results on my Skylake server…

$ ./despacebenchmark
pointer alignment = 16 bytes
memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
countspaces(buffer, N):  3.673828 cycles / ops
despace(buffer, N):  5.578125 cycles / ops
faster_despace(buffer, N):  1.740234 cycles / ops
despace64(buffer, N):  2.542969 cycles / ops
despace_to(buffer, N, tmpbuffer):  5.585938 cycles / ops
avx2_countspaces(buffer, N):  0.365234 cycles / ops
avx2_despace(buffer, N):  3.593750 cycles / ops
sse4_despace(buffer, N):  0.816406 cycles / ops
sse4_despace_branchless(buffer, N):  0.386719 cycles / ops
sse4_despace_trail(buffer, N):  1.539062 cycles / ops
sse42_despace_branchless(buffer, N):  0.603516 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  0.675781 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  1.707031 cycles / ops
$ ./despacebenchmark  1
alignment offset = 1
pointer alignment = 1 bytes
memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
countspaces(buffer, N):  3.675781 cycles / ops
despace(buffer, N):  5.576172 cycles / ops
faster_despace(buffer, N):  1.693359 cycles / ops
despace64(buffer, N):  2.589844 cycles / ops
despace_to(buffer, N, tmpbuffer):  5.492188 cycles / ops
avx2_countspaces(buffer, N):  0.373047 cycles / ops
avx2_despace(buffer, N):  3.535156 cycles / ops
sse4_despace(buffer, N):  0.841797 cycles / ops
sse4_despace_branchless(buffer, N):  0.398438 cycles / ops
sse4_despace_trail(buffer, N):  1.599609 cycles / ops
sse42_despace_branchless(buffer, N):  0.605469 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  0.679688 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  1.718750 cycles / ops
$ ./despacebenchmark  4
alignment offset = 4
pointer alignment = 4 bytes
memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
countspaces(buffer, N):  3.679688 cycles / ops
despace(buffer, N):  5.630859 cycles / ops
faster_despace(buffer, N):  1.685547 cycles / ops
despace64(buffer, N):  2.562500 cycles / ops
despace_to(buffer, N, tmpbuffer):  5.523438 cycles / ops
avx2_countspaces(buffer, N):  0.373047 cycles / ops
avx2_despace(buffer, N):  3.603516 cycles / ops
sse4_despace(buffer, N):  0.843750 cycles / ops
sse4_despace_branchless(buffer, N):  0.400391 cycles / ops
sse4_despace_trail(buffer, N):  1.552734 cycles / ops
sse42_despace_branchless(buffer, N):  0.603516 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  0.681641 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  1.681641 cycles / ops

Results on a much older Ivy Bridge (like yours):

$ ./despacebenchmark
pointer alignment = 16 bytes
memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
countspaces(buffer, N):  14.519531 cycles / ops
despace(buffer, N):  12.612305 cycles / ops
faster_despace(buffer, N):  7.248047 cycles / ops
despace64(buffer, N):  10.275391 cycles / ops
despace_to(buffer, N, tmpbuffer):  11.271484 cycles / ops
sse4_despace(buffer, N):  2.039062 cycles / ops
sse4_despace_branchless(buffer, N):  1.081055 cycles / ops
sse4_despace_trail(buffer, N):  4.904297 cycles / ops
sse42_despace_branchless(buffer, N):  1.511719 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  1.554688 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  3.882812 cycles / ops
$ ./despacebenchmark 1
alignment offset = 1
pointer alignment = 1 bytes
memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
countspaces(buffer, N):  13.337891 cycles / ops
despace(buffer, N):  12.699219 cycles / ops
faster_despace(buffer, N):  7.292969 cycles / ops
despace64(buffer, N):  10.285156 cycles / ops
despace_to(buffer, N, tmpbuffer):  11.269531 cycles / ops
sse4_despace(buffer, N):  2.214844 cycles / ops
sse4_despace_branchless(buffer, N):  1.749023 cycles / ops
sse4_despace_trail(buffer, N):  5.027344 cycles / ops
sse42_despace_branchless(buffer, N):  2.109375 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  2.187500 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  3.890625 cycles / ops
$ ./despacebenchmark 4
alignment offset = 4
pointer alignment = 4 bytes
memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
countspaces(buffer, N):  14.507812 cycles / ops
despace(buffer, N):  12.699219 cycles / ops
faster_despace(buffer, N):  7.214844 cycles / ops
despace64(buffer, N):  10.214844 cycles / ops
despace_to(buffer, N, tmpbuffer):  11.289062 cycles / ops
sse4_despace(buffer, N):  2.162109 cycles / ops
sse4_despace_branchless(buffer, N):  1.757812 cycles / ops
sse4_despace_trail(buffer, N):  5.044922 cycles / ops
sse42_despace_branchless(buffer, N):  2.109375 cycles / ops
sse42_despace_branchless_lookup(buffer, N):  2.302734 cycles / ops
sse42_despace_to(buffer, N,tmpbuffer):  3.875000 cycles / ops

There are ways to trigger performance issues due to alignment on recent x64 processors, but you almost have to go out of your way to do it.

Julien says:

January 21, 2017 at 12:17 am

It seems you are not memory aligning your buffers.

Memory align a temporary buffer that you memcpy the input string to. After that do the operation on the temporary buffer.

(Consider doing a 64 byte move starting on address 63. The CPU would have to read/write not 64 bytes but 2×64 bytes – since the address was not aligned.).
Anton says:

January 21, 2017 at 2:55 am

Interesting.
I run a despace program on a GTX 1080 :
https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5f29ee

Seems like the gpu can do around 0.25 cycles/byte which is better than SIMD code but not quite as good as memcpy.
The code also looks simpler than SIMD code.
Paolo says:

January 21, 2017 at 8:02 am

In the naive implementation you copy even the bytes before the first space. They don’t need to be copied because they’re already there. Two loops could make it faster.
1. Daniel Lemire says:
  
  January 23, 2017 at 9:36 pm
  
  @Paolo
  
  Can you write a faster version?
exDM69 says:

January 21, 2017 at 3:31 pm

This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.

The biggest positive benefit of writing code like this (good CPU use in a memory bound problem) is giving the CPU an opportunity to do some hyperthreading while the current CPU thread is waiting for memory.

I wrote some more insights from my experiences in similar optimization tasks over at HN: https://news.ycombinator.com/item?id=13450245

I added wall clock time measuring with clock_gettime http://pasteall.org/208511

Here are the results:

memcpy(tmpbuffer,buffer,N): 0.125130 cycles / ops 1451124749 nsec (1.451125 sec)
countspaces(buffer, N): 3.710938 cycles / ops 1672809379 nsec (1.672809 sec)
despace(buffer, N): 6.603971 cycles / ops 1606222818 nsec (1.606223 sec)
faster_despace(buffer, N): 1.688668 cycles / ops 1547048647 nsec (1.547049 sec)
despace64(buffer, N): 3.590088 cycles / ops 1602308741 nsec (1.602309 sec)
despace_to(buffer, N, tmpbuffer): 6.306158 cycles / ops 1630737083 nsec (1.630737 sec)
avx2_countspaces(buffer, N): 0.191219 cycles / ops 1442104811 nsec (1.442105 sec)
avx2_despace(buffer, N): 5.754352 cycles / ops 1579367806 nsec (1.579368 sec)
sse4_despace(buffer, N): 0.984717 cycles / ops 1486145973 nsec (1.486146 sec)
sse4_despace_branchless(buffer, N): 0.339487 cycles / ops 1498728832 nsec (1.498729 sec)
sse4_despace_trail(buffer, N): 1.958952 cycles / ops 1479234359 nsec (1.479234 sec)
sse42_despace_branchless(buffer, N): 0.563217 cycles / ops 1534317675 nsec (1.534318 sec)
sse42_despace_branchless_lookup(buffer, N): 0.624996 cycles / ops 1563285415 nsec (1.563285 sec)
sse42_despace_to(buffer, N,tmpbuffer): 1.747602 cycles / ops 1525936772 nsec (1.525937 sec)
1. Daniel Lemire says:
  
  January 23, 2017 at 4:01 pm
  
  This is a completely memory bound problem and the â€œfastâ€ version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.
  
  On recent Intel processors, rdtsc measures the wall-clock time though maybe more accurately due to low overhead.
Alf P. Steinbach says:

January 21, 2017 at 4:22 pm

Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.

Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
1. Daniel Lemire says:
  
  January 23, 2017 at 3:55 pm
  
  Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.
  
  Thanks, I fixed that. It does not impact performance. In this case, my code passed sanity tests because of the statistics of my test.
  
  Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
  
  This makes sense. I didn’t want to bother with an explanation in the course of my blog post because *why* it works is not so important.
Alex N says:

January 21, 2017 at 5:20 pm

I’m not at a proper computer atm. Can you try

1. Load 8bytes into xmm.
2. Compare for equality with spaces, etc.
3. Move the mask to a general purpose register.
4. Negate bits in that register.
5. Copy selected bits with PEXT (from bmi2 set) instruction.
6. Use POPCNT to advance a pointer.

Thanks,
Alex
1. Daniel Lemire says:
  
  January 24, 2017 at 3:11 am
  
  Using PEXT is interesting. We have 16 bytes to deal with. PEXT only operates on 8 bytes. So we need two, with two pointer offsets. It might be fast, but it is not obviously faster than a SIMD-based approach, I think.
  1. Alex N says:
    
    January 24, 2017 at 11:48 am
    
    One advantage of PEXT approach is that it doesn’t use a table.
    
    Alex
    1. Daniel Lemire says:
      
      January 24, 2017 at 1:51 pm
      
      @Alex PEXT require: a mask, you either generate it on the fly or you look it up. I think you envision generating it on the fly… I’d be interested in the approach you’d take.
aqrit says:

January 21, 2017 at 7:44 pm

What is the performance of a simple sorting network
using SIMD?

Would it be better to define the problem as “shift all the whitespace to side of the vector”?
1. Daniel Lemire says:
  
  January 23, 2017 at 2:57 pm
  
  I am not sure which algorithm you have in mind. Sorting networks can be implemented just fine using SIMD, but I am not sure it helps for this problem.
David Krauss "Potatoswatter" says:

January 21, 2017 at 8:51 pm

There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.

To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.

Anyway, that table is huge, 256 KiB. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
1. Daniel Lemire says:
  
  January 23, 2017 at 2:55 pm
  
  There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.
  
  I don’t think that there is an error, if there is, please point it out more precisely. In the blog post, I omit the cast which is present in the GitHub repository, but it is a technical detail that despace_mask16 is not of type __m128i * (that is, you can rewrite the code so that it is).
  
  To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.
  
  In C, _mm_set_epi8 is not a compile-time constant so you cannot do “__m128i array[] = {_mm_set1_epi8(0)};”. You must be thinking of C++.
  
  It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
  
  I think that the table is 1MB.
2. Daniel Lemire says:
  
  January 24, 2017 at 3:07 am
  
  Anyway, that table is huge. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
  
  Please see the GitHub repository, function sse4_despace_branchless_mask8. It is a tad slower. There is an extra load, an extra mask, an extra shift and and an extra XOR. All of these things appear to add up.
Billy O'Neal says:

January 22, 2017 at 6:35 am

What is / can you explain `despace_mask16`? I don’t understand the bit twiddling hack going on here to give the right constant to the shuffle unit.
1. Daniel Lemire says:
  
  January 23, 2017 at 3:08 pm
  
  @Billy
  
  What is / can you explain `despace_mask16`?
  
  We are using the _mm_shuffle_epi8 intrinsic to move the bytes around so that flagged characters are omitted. This intrinsic takes an input register “a” as well as a control mask “b” and it outputs a new vector
  (a[b[0]], a[b[1]], a[b[2]], a[b[3]], …). Computing the mask “b” on the fly is hard given the locations of the characters, so we use a look-up table.
  
  It is not pretty.
  1. Billy O'Neal says:
    
    January 24, 2017 at 1:18 am
    
    >so we use a look-up table
    
    Derp on my part — for some reason I read that initially as supplying that constant to _mm_shuffle_epi8 — I missed the load instruction. Lookup table makes sense, thanks!
Eric says:

January 22, 2017 at 2:07 pm

Where did `despace_mask16` come from there? Surely that’s not an intel builtin!
1. Daniel Lemire says:
  
  January 23, 2017 at 2:43 pm
  
  I don’t give the full code within the blog post, you can find it on GitHub:
  
  https://github.com/lemire/despacer
Brad Daniels says:

January 24, 2017 at 2:43 am

It looks to me like the SIMD code won’t copy the last few bytes if the size is not a multiple of 16. Am I missing something?
1. Daniel Lemire says:
  
  January 24, 2017 at 3:04 am
  
  In my blog post, I provide only the gist of the code, please see the GitHub repo for actual working code.
Stefano Gualandi says:

January 24, 2017 at 5:41 pm

Thanks, this is a pretty nice post. I used your examples to see the impact of mispredicted branches in Python (*): I somehow was surprised of their impact on this short piece of code. Thanks again!

(*) Here the link to my python notebook:
https://github.com/stegua/MyBlogEntries/blob/master/Remove%2Bblanks%2Bfrom%2Ba%2Bstring.ipynb
Peter says:

August 5, 2017 at 5:18 pm

I am stating the obvious, but “Leffmann” approach is not doing the same thing as the original.

It depends on an unseen pre-processing stage that correctly fills an array with 1s and 0s. I expect that in practice it will in fact be slower when this stage is also accounted for.