for(size_t i = 0; i < howmany; i++) {
char c = *src++;
if (c == '\r' || c == '\n' || c == ' ') {
continue;
}
*dest++ = c;
}
return dest – bytes;
}
Nathan Kurzsays:
Nice demo, but why would it be foolish to expect it to be 16x faster? Have you already lost faith in the power of magic vector dust? Relative to memcpy, what do you think the fundamental limit should be here?
I may be doing something wrong, and I realize it isn’t exactly the point of your illustration, but I get about a 2x speedup if I remove the “if (mask16 == 0)” conditional from your code to make it branchless.
As you’d expect, the advantage of avoiding the branchless approach grows if you increase the probability of spaces (.94 vs .40 at 2%), and decreases as spaces become rarer (.47 vs .40 at .5%).
I wouldn’t assume that loop unrolling would have a significant effect here, although I also wouldn’t assume without looking that the compiler isn’t already doing it.
It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
Nice demo, but why would it be foolish to expect it to be 16x faster?
I prefer to be pessimistic so that I am rarely disappointed.
It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
Yes. Good idea.
Thomas A. Finesays:
“It is going to be faster as long as most blocks of eight characters do not contain any white space.”
This doesn’t seem very true for most data sets.
Also, I wouldn’t spend all that effort on bitwise checks and register copies, and instead focus on your basic algorithm. You are copying every byte twice. (Unless you are relying on optimization to magically fix that, but that sort of dodges the whole point of coding faster algorithms, in my opinion.) A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
I am not sure what you mean? In the simple scalar code, I read each byte once, and I write most of them back.
A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
I did consider an implementation based on memmove, but you will need many memmove calls in general.
I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
I’d very much like to see your code.
Thomas A. Finesays:
When I say you are copying every byte twice, I mean this:
char c = bytes[i];
…
bytes[pos++] = c;
Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
sub despace_lemire {
my $text=shift(@_);
my $howmany=shift(@_);
my $i;
my $c;
my $pos=0;
for ($i=0; $i<$howmany; ++$i) {
$c=substr($text,$i,1);
next if ($c eq " ");
substr($text,$pos++,1)=$c;
}
#truncate to new length (perl doesn't copy nulls)
$text=substr($text,0,$pos);
return($text);
}
sub despace_mine {
my $text=shift(@_);
my $howmany=shift(@_);
my $i=0;
my $c;
my $from;
my $to;
When I say you are copying every byte twice, I mean this:
char c = bytes[i];
…
bytes[pos++] = c;
Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
I would argue that in C, these are semantically equivalent…
char c = bytes[i];
bytes[pos] = c;
and
bytes[pos] = bytes[i];
Regarding how this gets compiled… on all processors that I am familiar with, they both get compiled to two “move” instructions. You first have to retrieve “bytes[i]” to a register, then you need to write it to memory. There might be processors that support moving memory from RAM to RAM without going through a register, but I am not familiar with them.
There are certainly instructions (on x64 processors) that work directly on memory, without loading the data into a register (the bit test instruction one such instruction). However, it is not necessarily faster and can even be much slower. I believe that in a lot of important cases, the compiler will get it right.
My understanding is that there is no such instruction on ARM processors, they always need to load the data in memory. So there is a load, an operation and then a store.
But let us focus on x64 processors… Is there any difference between how an x64 compiler will handle
const uint64_t mask = 1111;
a[0] &= mask;
and
a[0] &= 1111;
???
Do you have any example where it produces different machine code?
Darrell Wrightsays:
I wonder if it would be faster look at 32bits at a time. At least in english, a bulk of the words are in the 3-5 letter length or under.
Juliensays:
It seems you are not considering alignment.
Try the following: allocate and memory align a buffer, memcpy the string into it and then do the SIMD job on the memory aligned string.
(Consider how the cache works, if you try move 64 bytes starting on on address 63, then it has to read not 64 bytes but 128 bytes, so be sure to align so that the starting address of the allocated buffer is evenly divisibly by 64 – do that by over-allocating and then move the pointer appropriately)
Not quite the improvement I thought it would be. Yet it definitely did result in an improvement on my machine (Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz)
align to: 3
sse42_despace_branchless_lookup(buffer, N): 1.058594 cycles / ops
align to: 7
sse42_despace_branchless_lookup(buffer, N): 1.001953 cycles / ops
align to: 63
sse42_despace_branchless_lookup(buffer, N): 0.990234 cycles / ops
align to: 2
sse42_despace_branchless_lookup(buffer, N): 0.963867 cycles / ops
align to: 8
sse42_despace_branchless_lookup(buffer, N): 0.896484 cycles / ops
align to: 16
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 32
sse42_despace_branchless_lookup(buffer, N): 0.568359 cycles / ops
align to: 64
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
align to: 128
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 256
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
Juliensays:
Let me clarify that. Aligning to 64 – you are completely right. Alignment being a “myth” – not so much, at least on my machine.
Billy O'Nealsays:
Note that this answer will depend on the chip you use for testing. When I implemented a similar algorithm for UTF-8 to UTF-16 transformation, doing extra work to align things was moderately better on Haswell-E (Xeon E5-1650v3), but was no gain on Skylake-T (Core i7 6700T). In general, the newer chips seem to deal with unaligned access better than older ones.
There are ways to trigger performance issues due to alignment on recent x64 processors, but you almost have to go out of your way to do it.
Juliensays:
It seems you are not memory aligning your buffers.
Memory align a temporary buffer that you memcpy the input string to. After that do the operation on the temporary buffer.
(Consider doing a 64 byte move starting on address 63. The CPU would have to read/write not 64 bytes but 2×64 bytes – since the address was not aligned.).
Seems like the gpu can do around 0.25 cycles/byte which is better than SIMD code but not quite as good as memcpy.
The code also looks simpler than SIMD code.
Paolosays:
In the naive implementation you copy even the bytes before the first space. They don’t need to be copied because they’re already there. Two loops could make it faster.
This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.
The biggest positive benefit of writing code like this (good CPU use in a memory bound problem) is giving the CPU an opportunity to do some hyperthreading while the current CPU thread is waiting for memory.
This is a completely memory bound problem and the “fast†version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.
On recent Intel processors, rdtsc measures the wall-clock time though maybe more accurately due to low overhead.
Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.
Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.
Thanks, I fixed that. It does not impact performance. In this case, my code passed sanity tests because of the statistics of my test.
Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
This makes sense. I didn’t want to bother with an explanation in the course of my blog post because *why* it works is not so important.
Alex Nsays:
I’m not at a proper computer atm. Can you try
1. Load 8bytes into xmm.
2. Compare for equality with spaces, etc.
3. Move the mask to a general purpose register.
4. Negate bits in that register.
5. Copy selected bits with PEXT (from bmi2 set) instruction.
6. Use POPCNT to advance a pointer.
Using PEXT is interesting. We have 16 bytes to deal with. PEXT only operates on 8 bytes. So we need two, with two pointer offsets. It might be fast, but it is not obviously faster than a SIMD-based approach, I think.
Alex Nsays:
One advantage of PEXT approach is that it doesn’t use a table.
@Alex PEXT require: a mask, you either generate it on the fly or you look it up. I think you envision generating it on the fly… I’d be interested in the approach you’d take.
aqritsays:
What is the performance of a simple sorting network
using SIMD?
Would it be better to define the problem as “shift all the whitespace to side of the vector”?
There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.
To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.
Anyway, that table is huge, 256 KiB. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.
I don’t think that there is an error, if there is, please point it out more precisely. In the blog post, I omit the cast which is present in the GitHub repository, but it is a technical detail that despace_mask16 is not of type __m128i * (that is, you can rewrite the code so that it is).
To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.
In C, _mm_set_epi8 is not a compile-time constant so you cannot do “__m128i array[] = {_mm_set1_epi8(0)};”. You must be thinking of C++.
It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
Anyway, that table is huge. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
Please see the GitHub repository, function sse4_despace_branchless_mask8. It is a tad slower. There is an extra load, an extra mask, an extra shift and and an extra XOR. All of these things appear to add up.
Billy O'Nealsays:
What is / can you explain `despace_mask16`? I don’t understand the bit twiddling hack going on here to give the right constant to the shuffle unit.
We are using the _mm_shuffle_epi8 intrinsic to move the bytes around so that flagged characters are omitted. This intrinsic takes an input register “a” as well as a control mask “b” and it outputs a new vector
(a[b[0]], a[b[1]], a[b[2]], a[b[3]], …). Computing the mask “b” on the fly is hard given the locations of the characters, so we use a look-up table.
It is not pretty.
Billy O'Nealsays:
>so we use a look-up table
Derp on my part — for some reason I read that initially as supplying that constant to _mm_shuffle_epi8 — I missed the load instruction. Lookup table makes sense, thanks!
Ericsays:
Where did `despace_mask16` come from there? Surely that’s not an intel builtin!
Thanks, this is a pretty nice post. I used your examples to see the impact of mispredicted branches in Python (*): I somehow was surprised of their impact on this short piece of code. Thanks again!
I am stating the obvious, but “Leffmann” approach is not doing the same thing as the original.
It depends on an unseen pre-processing stage that correctly fills an array with 1s and 0s. I expect that in practice it will in fact be slower when this stage is also accounted for.
I’m interested in what the benchmark is for this
size_t despace(char * bytes, size_t howmany) {
char* src = bytes;
char* dest = bytes;
for(size_t i = 0; i < howmany; i++) {
char c = *src++;
if (c == '\r' || c == '\n' || c == ' ') {
continue;
}
*dest++ = c;
}
return dest – bytes;
}
Nice demo, but why would it be foolish to expect it to be 16x faster? Have you already lost faith in the power of magic vector dust? Relative to memcpy, what do you think the fundamental limit should be here?
I may be doing something wrong, and I realize it isn’t exactly the point of your illustration, but I get about a 2x speedup if I remove the “if (mask16 == 0)” conditional from your code to make it branchless.
nate@skylake:~/git/despacer$ despacebenchmark
sse4_despace(buffer, N): 0.818359 cycles / ops
sse4_despace_branchless(buffer, N): 0.400391 cycles / ops
As you’d expect, the advantage of avoiding the branchless approach grows if you increase the probability of spaces (.94 vs .40 at 2%), and decreases as spaces become rarer (.47 vs .40 at .5%).
I wouldn’t assume that loop unrolling would have a significant effect here, although I also wouldn’t assume without looking that the compiler isn’t already doing it.
It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
Nice demo, but why would it be foolish to expect it to be 16x faster?
I prefer to be pessimistic so that I am rarely disappointed.
It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.
Yes. Good idea.
“It is going to be faster as long as most blocks of eight characters do not contain any white space.”
This doesn’t seem very true for most data sets.
Also, I wouldn’t spend all that effort on bitwise checks and register copies, and instead focus on your basic algorithm. You are copying every byte twice. (Unless you are relying on optimization to magically fix that, but that sort of dodges the whole point of coding faster algorithms, in my opinion.) A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
You are copying every byte twice.
I am not sure what you mean? In the simple scalar code, I read each byte once, and I write most of them back.
A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
I did consider an implementation based on memmove, but you will need many memmove calls in general.
I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.
I’d very much like to see your code.
When I say you are copying every byte twice, I mean this:
char c = bytes[i];
…
bytes[pos++] = c;
Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
Here’s the perl comparison:
—-
#!/usr/bin/perl
$text=join(” “,@ARGV);
$howmany=length($text);
print “lemire: {“, despace_lemire($text,$howmany), “} “;
$t=time;
for ($n=0; $n<1000000; ++$n) {
despace_lemire($text,$howmany);
}
print time-$t, "\n";
print "mine: {", despace_mine($text,$howmany), "} ";
$t=time;
for ($n=0; $n<1000000; ++$n) {
despace_mine($text,$howmany);
}
print time-$t, "\n";
sub despace_lemire {
my $text=shift(@_);
my $howmany=shift(@_);
my $i;
my $c;
my $pos=0;
for ($i=0; $i<$howmany; ++$i) {
$c=substr($text,$i,1);
next if ($c eq " ");
substr($text,$pos++,1)=$c;
}
#truncate to new length (perl doesn't copy nulls)
$text=substr($text,0,$pos);
return($text);
}
sub despace_mine {
my $text=shift(@_);
my $howmany=shift(@_);
my $i=0;
my $c;
my $from;
my $to;
#skip initial non-space section
$i=0;
while(substr($text,$i,1) ne " ") { ++$i; }
$to=$i;
#skip whitespace we found
while(substr($text,$i,1) eq " ") { ++$i; }
$from=$i;
while($i= $howmany); }
#copy
substr($text,$to,$i-$from) = substr($text,$from,$i-$from);
$to += $i-$from;
#skip whitespace
while(substr($text,$i,1) eq ” “) { last if (++$i >= $howmany); }
$from=$i;
}
#truncate to new length (perl doesn’t copy nulls)
$text=substr($text,0,$to);
return($text);
}
When I say you are copying every byte twice, I mean this:
char c = bytes[i];
…
bytes[pos++] = c;
Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.
I would argue that in C, these are semantically equivalent…
char c = bytes[i];
bytes[pos] = c;
and
bytes[pos] = bytes[i];
Regarding how this gets compiled… on all processors that I am familiar with, they both get compiled to two “move” instructions. You first have to retrieve “bytes[i]” to a register, then you need to write it to memory. There might be processors that support moving memory from RAM to RAM without going through a register, but I am not familiar with them.
There are certainly instructions (on x64 processors) that work directly on memory, without loading the data into a register (the bit test instruction one such instruction). However, it is not necessarily faster and can even be much slower. I believe that in a lot of important cases, the compiler will get it right.
My understanding is that there is no such instruction on ARM processors, they always need to load the data in memory. So there is a load, an operation and then a store.
But let us focus on x64 processors… Is there any difference between how an x64 compiler will handle
const uint64_t mask = 1111;
a[0] &= mask;
and
a[0] &= 1111;
???
Do you have any example where it produces different machine code?
I wonder if it would be faster look at 32bits at a time. At least in english, a bulk of the words are in the 3-5 letter length or under.
It seems you are not considering alignment.
Try the following: allocate and memory align a buffer, memcpy the string into it and then do the SIMD job on the memory aligned string.
(Consider how the cache works, if you try move 64 bytes starting on on address 63, then it has to read not 64 bytes but 128 bytes, so be sure to align so that the starting address of the allocated buffer is evenly divisibly by 64 – do that by over-allocating and then move the pointer appropriately)
Please see Data alignment for speed: myth or reality? http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
Thanks, interesting!
Not quite the improvement I thought it would be. Yet it definitely did result in an improvement on my machine (Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz)
align to: 3
sse42_despace_branchless_lookup(buffer, N): 1.058594 cycles / ops
align to: 7
sse42_despace_branchless_lookup(buffer, N): 1.001953 cycles / ops
align to: 63
sse42_despace_branchless_lookup(buffer, N): 0.990234 cycles / ops
align to: 2
sse42_despace_branchless_lookup(buffer, N): 0.963867 cycles / ops
align to: 8
sse42_despace_branchless_lookup(buffer, N): 0.896484 cycles / ops
align to: 16
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 32
sse42_despace_branchless_lookup(buffer, N): 0.568359 cycles / ops
align to: 64
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
align to: 128
sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
align to: 256
sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
Let me clarify that. Aligning to 64 – you are completely right. Alignment being a “myth” – not so much, at least on my machine.
Note that this answer will depend on the chip you use for testing. When I implemented a similar algorithm for UTF-8 to UTF-16 transformation, doing extra work to align things was moderately better on Haswell-E (Xeon E5-1650v3), but was no gain on Skylake-T (Core i7 6700T). In general, the newer chips seem to deal with unaligned access better than older ones.
I modified my benchmark so that you can add an alignment offset…
Results on my Skylake server…
Results on a much older Ivy Bridge (like yours):
There are ways to trigger performance issues due to alignment on recent x64 processors, but you almost have to go out of your way to do it.
It seems you are not memory aligning your buffers.
Memory align a temporary buffer that you memcpy the input string to. After that do the operation on the temporary buffer.
(Consider doing a 64 byte move starting on address 63. The CPU would have to read/write not 64 bytes but 2×64 bytes – since the address was not aligned.).
Interesting.
I run a despace program on a GTX 1080 :
https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5f29ee
Seems like the gpu can do around 0.25 cycles/byte which is better than SIMD code but not quite as good as memcpy.
The code also looks simpler than SIMD code.
In the naive implementation you copy even the bytes before the first space. They don’t need to be copied because they’re already there. Two loops could make it faster.
@Paolo
Can you write a faster version?
This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.
The biggest positive benefit of writing code like this (good CPU use in a memory bound problem) is giving the CPU an opportunity to do some hyperthreading while the current CPU thread is waiting for memory.
I wrote some more insights from my experiences in similar optimization tasks over at HN: https://news.ycombinator.com/item?id=13450245
I added wall clock time measuring with clock_gettime http://pasteall.org/208511
Here are the results:
memcpy(tmpbuffer,buffer,N): 0.125130 cycles / ops 1451124749 nsec (1.451125 sec)
countspaces(buffer, N): 3.710938 cycles / ops 1672809379 nsec (1.672809 sec)
despace(buffer, N): 6.603971 cycles / ops 1606222818 nsec (1.606223 sec)
faster_despace(buffer, N): 1.688668 cycles / ops 1547048647 nsec (1.547049 sec)
despace64(buffer, N): 3.590088 cycles / ops 1602308741 nsec (1.602309 sec)
despace_to(buffer, N, tmpbuffer): 6.306158 cycles / ops 1630737083 nsec (1.630737 sec)
avx2_countspaces(buffer, N): 0.191219 cycles / ops 1442104811 nsec (1.442105 sec)
avx2_despace(buffer, N): 5.754352 cycles / ops 1579367806 nsec (1.579368 sec)
sse4_despace(buffer, N): 0.984717 cycles / ops 1486145973 nsec (1.486146 sec)
sse4_despace_branchless(buffer, N): 0.339487 cycles / ops 1498728832 nsec (1.498729 sec)
sse4_despace_trail(buffer, N): 1.958952 cycles / ops 1479234359 nsec (1.479234 sec)
sse42_despace_branchless(buffer, N): 0.563217 cycles / ops 1534317675 nsec (1.534318 sec)
sse42_despace_branchless_lookup(buffer, N): 0.624996 cycles / ops 1563285415 nsec (1.563285 sec)
sse42_despace_to(buffer, N,tmpbuffer): 1.747602 cycles / ops 1525936772 nsec (1.525937 sec)
This is a completely memory bound problem and the “fast†version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.
On recent Intel processors, rdtsc measures the wall-clock time though maybe more accurately due to low overhead.
Assuming this is using the built-in
^
operator, the conditionhaszero(xor1) ^ haszero(xor2) ^ haszero(xor3))
is true if and only if an odd number of the function results are true.Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.
Thanks, I fixed that. It does not impact performance. In this case, my code passed sanity tests because of the statistics of my test.
Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.
This makes sense. I didn’t want to bother with an explanation in the course of my blog post because *why* it works is not so important.
I’m not at a proper computer atm. Can you try
1. Load 8bytes into xmm.
2. Compare for equality with spaces, etc.
3. Move the mask to a general purpose register.
4. Negate bits in that register.
5. Copy selected bits with PEXT (from bmi2 set) instruction.
6. Use POPCNT to advance a pointer.
Thanks,
Alex
Using PEXT is interesting. We have 16 bytes to deal with. PEXT only operates on 8 bytes. So we need two, with two pointer offsets. It might be fast, but it is not obviously faster than a SIMD-based approach, I think.
One advantage of PEXT approach is that it doesn’t use a table.
Alex
@Alex PEXT require: a mask, you either generate it on the fly or you look it up. I think you envision generating it on the fly… I’d be interested in the approach you’d take.
What is the performance of a simple sorting network
using SIMD?
Would it be better to define the problem as “shift all the whitespace to side of the vector”?
I am not sure which algorithm you have in mind. Sorting networks can be implemented just fine using SIMD, but I am not sure it helps for this problem.
There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.
To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.
Anyway, that table is huge, 256 KiB. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.
I don’t think that there is an error, if there is, please point it out more precisely. In the blog post, I omit the cast which is present in the GitHub repository, but it is a technical detail that despace_mask16 is not of type __m128i * (that is, you can rewrite the code so that it is).
To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.
In C, _mm_set_epi8 is not a compile-time constant so you cannot do “__m128i array[] = {_mm_set1_epi8(0)};”. You must be thinking of C++.
It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
I think that the table is 1MB.
Anyway, that table is huge. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.
Please see the GitHub repository, function sse4_despace_branchless_mask8. It is a tad slower. There is an extra load, an extra mask, an extra shift and and an extra XOR. All of these things appear to add up.
What is / can you explain `despace_mask16`? I don’t understand the bit twiddling hack going on here to give the right constant to the shuffle unit.
@Billy
What is / can you explain `despace_mask16`?
We are using the _mm_shuffle_epi8 intrinsic to move the bytes around so that flagged characters are omitted. This intrinsic takes an input register “a” as well as a control mask “b” and it outputs a new vector
(a[b[0]], a[b[1]], a[b[2]], a[b[3]], …). Computing the mask “b” on the fly is hard given the locations of the characters, so we use a look-up table.
It is not pretty.
>so we use a look-up table
Derp on my part — for some reason I read that initially as supplying that constant to _mm_shuffle_epi8 — I missed the load instruction. Lookup table makes sense, thanks!
Where did `despace_mask16` come from there? Surely that’s not an intel builtin!
I don’t give the full code within the blog post, you can find it on GitHub:
https://github.com/lemire/despacer
It looks to me like the SIMD code won’t copy the last few bytes if the size is not a multiple of 16. Am I missing something?
In my blog post, I provide only the gist of the code, please see the GitHub repo for actual working code.
Thanks, this is a pretty nice post. I used your examples to see the impact of mispredicted branches in Python (*): I somehow was surprised of their impact on this short piece of code. Thanks again!
(*) Here the link to my python notebook:
https://github.com/stegua/MyBlogEntries/blob/master/Remove%2Bblanks%2Bfrom%2Ba%2Bstring.ipynb
I am stating the obvious, but “Leffmann” approach is not doing the same thing as the original.
It depends on an unseen pre-processing stage that correctly fills an array with 1s and 0s. I expect that in practice it will in fact be slower when this stage is also accounted for.