I think you’re right. Currently, the code doesn’t use the as_str field at all. Why do even need the digits union? Am I missing something?
dnnfndmsjssays:
Nope. The whole trick with enum here is UB. Compiler is allowed to optimize it the way it wants. The proper solution is std::bit_cast. And memcpy is a C function that is not a constexpr, while std::bit_cast while is, allowing to make the whole function a constexpr.
I copied and paste code form your blog post and it does not compile, it contained a massive security hole, it contained a terrible bug or it crashed my computer. Can I blame you? No. Lemire’s rule: Blogging is literature, not engineering. Code taken from a blog post should not be expected to work, it is meant to illustrate an idea. Don’t build production systems by copying and pasting random code from the Internet. It will not end well.
I do build a lot of code that is meant to be deployed in production systems. The code used on my blog post is not meant for that. I do not run tests on mainframe platforms, for example.
Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
It is almost surely not sufficient. Can you prove that it is?
Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
Can you share your full code? I’d be happy to benchmark it.
Jean-Marc Bourguetsays:
Now that I’ve had time to look at it, the issue with simple multiplication for big endian is there will be unwanted carry. So it doesn’t work.
Combining the two shifts is essentially the same idea as KWillets and suffers the same issue of validation.
On the other hand, if I’m not confused again, you can use
I wondered whether and how much removing the comparison would speed up parsing. It turns out, a fair bit. On my machine (x86_64, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) the benchmark goes from
If I’ve got my arithmetic right (or left in this case) the lower 3 bytes of 640a01 * (the unshifted digits) will contain three quantities, in little-endian byte order:
the first digit times one
the first digit times ten plus the second digit
the first digit times 100 plus the second digit times 10 plus the third digit.
Pick which one to output from the length argument.
This technique also tolerates cruft on the end, so you can unconditionally load 3-4 bytes and still pull the correct value even if bytes to the left overflow (eg “9” maps to 9/90/900, but the 900 doesn’t overflow into the part we want).
The idea works but I am not sure it could be faster.
KWilletssays:
I see; there are a lot of validation cycles in there that I didn’t consider.
One thought on validation is that a well-formed ASCII integer will have the intermediate lanes bounded by 9/99/256 up to the output lane (eg “91” will have the first two lanes bounded but 910 > 256 in the third). A non-digit input should fail in one of the lanes (using 0/0/0 lower bound).
if len==1, do the ASCII conversion
else if len == 2, append two digits into a 16-bit int, subtract 0x0300 to convert the first byte and lookup in a 2560 item sparsely filled table
else live with a very big table or do the ASCI conversion on the first byte, multiply by 100 and add the len == 2 conversion.
Yes. We can use a lookup table, if only for fun. Do you have an implementation?
Samuel Leesays:
Hmm, doesn’t the SWAR method give the wrong return value given an input string like “12>”? The validation of only having ASCII digits seems to be missing a validation of 0x3A-0x3F.
Also there is possibly room to be faster by shifting the result of the multiplication to the right by a variable amount (based on len), rather than shifting the masked input to the left by a variable amount; the computation of the shift amount is more likely to be able to be done in parallel with other work this way.
Also the shift amount can be computed with len directly, rather than (len & 0x3), given that the validity of the result is indicated by the return value (i.e we don’t care about the shift amount when len>3).
The current version should provide complete validation although you are correct that the initial code used in this blog post provided only partial validation.
Jylamsays:
First, great work.
However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior
However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior.
The C++14 and C17 standards make it defined behaviour. Here is the specification:
The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 x 2^E2, reduced modulo one more than the maximum value representable in the result type.
Jylamsays:
Yes, as you maybe saw on twitter, I was referring to C99 (this behaviour is the same in C89 IIRC). Anyway, better be safe, and in all cases it’ll return early if len==0, so that’s that.
Gregsays:
The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library.
Indeed. Part of the reason is the “naive” function has a fixed upper bound on the number of loop iterations (the `r= len & 0x3 clause). As a result (and I checked in compiler explorer), the compiler always unrolls the loop. std::from_chars allows an unbounded number of leading zeros, so it can’t unconditionally unroll the loop (without having a suffix afterwards, I guess).
This doesn’t explain all of the performance difference, but, as usual, often getting performance out of code is being very precise about what the actual requirements for the inputs and outputs are.
Marcin Zukowskisays:
Cool post, pure fun and a few neat tricks 🙂
Two comments:
If you want to make this routine “safe” (and hence the comparison really fair), it’s enough to check if the input crosses a page boundary, and take a slow path if it is (or have a SWAR version reading from “the other side”). That should be very branch-predictor safe, so hopefully wouldn’t slow things down much. Some hash functions use this approach.
The benchmarks give some advantage to SWAR considering you’re using a uniform distribution in range 0..255, making the average string length 2.23. There might be use cases where the expected number of digits is smaller, and then SWAR might be less beneficial or even slower.
The first version of our fast JSON parser (simdjson) was page-boundary aware… but this gave us much trouble because people using sanitizers or valgrind would report bugs. I tried arguing back but it became tiresome. So the current version of simdjson use padded strings. It works well. I have a friend who writes string functions for a runtime library, but they get their code to be excluded from valgrind checks and sanitizers… so people don’t see anything… The use case for the function in this blog post is simdzone (parsing zone files for DNS systems) in which case the author knows that the input is padded.
The SWAR version is faster even in the case where we have a sequential input. So it has good chances of being faster in practice. Still people should benchmark on the data they care about!
KWilletssays:
Is regular old SIMD on the menu? The byte range-checking would certainly be easier, and shuffling of quartets etc.
Meanwhile, for what it’s worth, the performance (of all of the implementations here) appears to be sensitive to the mixture of inputs from the [0,255] range and the [256,999] range. It must be primarily due to the short circuiting of the various terms in the return value expressions. (The quick test is to change val to uint16_t and change %256 to %1000 in the benchmarker; otherwise the inputs are all <256 and we don’t need that validation…).
Presumably the performance will also depend on the mixture of invalid inputs as well (cases where len=0, len>3, or non digit chars).
The fast function from the blog post should compile to branchless, except for the length parameter, see the assembly in the blog post. If you use lengths that outside the allowed range ([1,3]) then the performance can vary… But otherwise, it should be flat. The function’s performance is not data dependent. You can see in my table of results.
Make sure your build your code in release mode with full optimizations.
Michael Dunphysays:
I agree that the fastswar performance is flat between sequential and random. I am finding that using values in [0,999] is a bit faster (about +13%, the mixture is approx 25%/75% valid/invalid), and even faster for values in [256,999] (about +18%, which 100% invalid values). The fastswar return statement uses & and not && so indeed that does not short-circuit… so I’m not sure what is the source of that extra performance for inputs >255.
Michael Dunphysays:
Scratch that; I was looking at the wrong metric (GB/s)
If we get past the first line, len will be 0, 1 or 2 (decremented by 1 from the initial 1, 2 or 3). When len=0, s1,s2,s3=1,0,0, for len=1, s1,s2,s3=10,1,0 and for len=3, s1,s2,s3 = 100,10,1. So s1,s2,s3 are always assigned. If s2 is zero we don’t care about the value of d2 (so the || short circuits), but if s2 is nonzero then we do need to check d2. Similar for s3 & d3. It’s a bit contorted in pursuit of avoiding branches.
Jason Msays:
I got nerdsniped by this a bit. I think your approach is getting close to optimal, it's hard to beat a constant time operation that has a small constant. Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast hash function (the key is the four byte string cast to a u32). Obviously it trades quite a bit of space for some speed. It could probably be compacted in space with a bit of tweaking. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar, as it should have a smaller constant. To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so: `
lut[simple_hash((unsigned*)(str))];
Wonderful article – I really enjoy this kind of work. The only bad part is that I spent a bit of time implementing this in Rust and took some time to fully understand some of your “magic”.
One minor observation – I believe that ((digits.as_int | (0x06060606 + digits.as_int)) doesn’t need the digits.as_int | component. My test suite passes fully without it.
My rust implementation (can’t seem to get the formatting quite right):`
fn make_u8(s: &str) -> Option<u8> {
if s.is_empty() || s.len() > 3 {
return None;
}
let bytes = s.as_bytes();
// using a union avoids branching on the length to initialize each byte
// of the u32 interpretation.
let mut working = unsafe {
#[repr(C)]
union U {
bytes: [u8; 4],
num: u32,
}
// could use uninit here to avoid initialization...
let mut u = U { num: 0 };
u.bytes[..s.len()].copy_from_slice(&bytes[..s.len()]);
u.num
};
working ^= 0x30303030;
working <<= (4 - s.len()) * 8;
// Wrapping prevents panics on overflow.
let mult = Wrapping(0x640a01) * Wrapping(working);
// unwrap it now (could just use .0 but this is more explicit)
let Wrapping(mult) = mult;
let num = (mult >> 24) as u8;
let all_digits = (0x06060606 + working) & 0xF0F0F0F0 == 0;
let swapped = u32::from_be_bytes(working.to_le_bytes());
if !all_digits || swapped > 0x00020505 {
return None;
}
Some(num)
}
Christopher Sahnwaldtsays:
I think the digits.as_int | part is necessary to catch bad input that contains characters with byte values 0 to 9, e.g. "12\003".
Christopher Sahnwaldtsays:
Oh, I was wrong. These cases are caught by 0x06060606 + digits.as_int as well.
@Bruce: Your tests pass without the digits.as_int | part because they don’t actually test the byte values 0xCA to 0xCF, although they seem to test values up to 0xFF.
The reason is that a Rust str contains UTF-8 bytes, not arbitrary bytes, and e.g. the value 0xFF gets converted to the two bytes 0xC3 and 0xBF.
yes, you’re right – thank you. I ignored Unicode characters (and their encoding). I will have some additional tests to verify that (and will add the OR’d digits.as_int back in).
There’s no hash table here, I just reinterpreted the 3-byte string as the index into a 8MB array containing every possible result. Despite needing so much memory, only small parts of the array are actually accessed in practice (the ones containing valid values) so it doesn’t waste much cache space
There are two versions – one where I’ve transformed the input strings in advance to pad them with nulls, so they can be directly interpreted as an integer, and one that works with the original string data by masking out the unwanted bytes. The version that uses padded strings is almost always faster by a large margin, but it’s kind of cheating so I put in the version with the mask as a more fair comparison
Some things I’ve noticed:
* The results vary quite a lot depending on whether you use or discard the return code. If you comment out ‘benchmark::DoNotOptimize(valid);’ from the benchmarks, the bit-twiddling functions get much faster
* Clang seems to be much better at optimising this than GCC
* The LUT-based approach seems to randomly vary in performance more than the others, which isn’t surprising since it relies much more on data remaining in the cache, and could be affected by other processes on the machine
The results vary quite a lot depending on whether you use or discard the return code.
That’s likely because your functions are inline-able. If you check my benchmark, you will notice that the functions are compiled separately. That’s by design: I do not want inlining as it changes the problem.
Clang seems to be much better at optimising this than GCC
I also tried running the benchmark on my laptop (linux with an alder lake processor), and saw some enormous improvements over the SWAR approaches (with clang at least, I’m still not sure why GCC does so badly on my machines). -march=native also makes quite a difference, especially for fastswar_bob:
Added with credit and some small modifications. I can confirm that GCC is slower. It seems to use more instructions to get the job done when using a LUT (about 5 extra instructions).
Christopher Sahnwaldtsays:
Here’s another Rust version:
fn parse_uint8_fastswar(b: &[u8]) -> Option<u8> {
if b.len() == 0 || b.len() > 3 { return None; }
let p = b.as_ptr() as *const u32;
let mut digits = unsafe { p.read_unaligned() };
digits ^= 0x30303030;
digits <<= (4 - b.len()) * 8;
let num = ((digits.wrapping_mul(0x640a01)) >> 24) as u8;
let all_digits = ((digits | (digits.wrapping_add(0x06060606))) & 0xF0F0F0F0) == 0;
(all_digits && digits.swap_bytes() <= 0x020505).then_some(num)
}
lea rax, [rsi - 4]
cmp rax, -3
jae .LBB3_2
xor eax, eax
ret
.LBB3_2:
mov eax, 808464432
xor eax, dword ptr [rdi]
neg sil
shl sil, 3
mov ecx, esi
shl eax, cl
imul edx, eax, 6556161
shr edx, 24
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
bswap eax
cmp eax, 132358
setb al
and al, cl
ret
I haven’t run benchmarks yet. I hope this might be slightly faster than C because Option<u8> is returned in two registers, so there’s no write through a pointer.
The benchmark results are somewhat inconclusive. Depending on CPU (Intel Xeon, Apple M2, Amazon Graviton) and compiler (GCC, Clang):
Sometimes Rust is faster than C, sometimes slower
Sometimes a C function returning an option in two registers is faster than writing the result through a pointer, sometimes slower
Sometimes fastswar is faster than fastswar_bob, sometimes slower
Sometimes one of the fastswar_* versions is as fast as the lut version, but usually lut is the fastest
It just causes UB and after that – your code is not C/C++. It is writen for SUPER specific case (little-endian architecture, specific compiler which ignores UB, caused by reading more numbers that is allowed, etc).
And, more than that, it is absoultely unreadable. I just have written the most stupid code I could imagine:
The blog post addresses your criticism, explicitly so:
Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer).
In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.
You claim that there is UB, but you don’t explain why.
As for code readability, that’s an important issue, of course. But then the blog post is specifically about optimizing for performance.
For your benchmark, you use sequential numbers. The blog posts explains that the approach being presented is only very beneficial when the inputs are unpredictable, it says so: So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable.
If your input is predictable, then sure, a naive implementation works. That’s already covered by the blog post.
As you can see, your code is running at about half the speed as parse_uint8_fastswar_bob and parse_uint8_lut. In fact, it has the same performance as parse_uint8_naive in my tests.
As for the criticism that the code posted on my blog is not ready for production, please see my terms of use. It is by design. The code presented in my blog posts is not meant to be copied and pasted into your projects.
Jeroen Koekkoeksays:
We can go faster if we limit dependencies(?) The idea is to ignore trash bytes and not shift them off. Given correct input, this is considerably faster on my machine. Obviously, it lacks proper error checking, but I think it’s an interesting enough approach to get input from others(?)
Neat! I have noticed a small mistake:
memcpy(&digits.as_int, str, sizeof(digits));
should be
memcpy(&digits.as_str, str, sizeof(digits));
I think you’re right. Currently, the code doesn’t use the
as_str
field at all. Why do even need thedigits
union? Am I missing something?Nope. The whole trick with enum here is UB. Compiler is allowed to optimize it the way it wants. The proper solution is
std::bit_cast
. Andmemcpy
is a C function that is not a constexpr, whilestd::bit_cast
while is, allowing to make the whole function aconstexpr
.This line, at the very least, means the code only works for little endian machines, right:
digits.as_int <<= (4 – (len & 0x3)) * 8;
For Big Endian, you just need to reverse the bytes, which is typically just one instruction.
Big endian systems are vanishingly rare.
Does the C language expression for such consistently map to just one instruction?
The rarity of big endian is no excuse to disregard it in a supposedly high-level language.
> The rarity of big endian is no excuse to disregard it in a supposedly high-level language
Do you also consider systems that have less/more than 8 bits in a byte or machine word? What about systems that don’t use two’s complement?
Yes, I do, and it can be a massive pain.
Please see the first rule on my terms of use: https://lemire.me/blog/terms-of-use/
I copied and paste code form your blog post and it does not compile, it contained a massive security hole, it contained a terrible bug or it crashed my computer. Can I blame you?
No. Lemire’s rule: Blogging is literature, not engineering. Code taken from a blog post should not be expected to work, it is meant to illustrate an idea. Don’t build production systems by copying and pasting random code from the Internet. It will not end well.
I do build a lot of code that is meant to be deployed in production systems. The code used on my blog post is not meant for that. I do not run tests on mainframe platforms, for example.
That’s fair enough. I wasn’t trying to be a jerk.
Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
It is almost surely not sufficient. Can you prove that it is?
Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
Can you share your full code? I’d be happy to benchmark it.
Now that I’ve had time to look at it, the issue with simple multiplication for big endian is there will be unwanted carry. So it doesn’t work.
Combining the two shifts is essentially the same idea as KWillets and suffers the same issue of validation.
On the other hand, if I’m not confused again, you can use
uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;
which is simpler but doesn’t reduces the depth of the chain of dependent instructions, so I fear any gain will depend on the micro-architecture.
And you could use
*num = (uint8_t)((UINT32_C(0x640a01) * digits.as_int) >> 24);
which probably doesn’t change anything on 64-bit computers but need only 32 bits x 32 bits -> 32 bits and thus could be useful on non-64-bit machines.
Yep. I have applied your proposals, see the credit section of the blog post. I will also mention you in the code.
Sorry, never mind what I wrote about a mistake. I need sleep.
Isn’t “y < 256” comparison useless in return? Because value assigned to y is always and’ed with 0xff.
Thanks. It was a small mistake.
I wondered whether and how much removing the comparison would speed up parsing. It turns out, a fair bit. On my machine (x86_64,
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
) the benchmark goes fromparse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d
to
parse_uint8_fastswar : 1.64 GB/s 638.6 Ma/s 1.57 ns/d
So GCC didn’t optimize it away.
If I’ve got my arithmetic right (or left in this case) the lower 3 bytes of 640a01 * (the unshifted digits) will contain three quantities, in little-endian byte order:
the first digit times one
the first digit times ten plus the second digit
the first digit times 100 plus the second digit times 10 plus the third digit.
Pick which one to output from the length argument.
This technique also tolerates cruft on the end, so you can unconditionally load 3-4 bytes and still pull the correct value even if bytes to the left overflow (eg “9” maps to 9/90/900, but the 900 doesn’t overflow into the part we want).
Sounds brilliant. If I can make it work, I will update with credit to you.
The idea works but I am not sure it could be faster.
I see; there are a lot of validation cycles in there that I didn’t consider.
One thought on validation is that a well-formed ASCII integer will have the intermediate lanes bounded by 9/99/256 up to the output lane (eg “91” will have the first two lanes bounded but 910 > 256 in the third). A non-digit input should fail in one of the lanes (using 0/0/0 lower bound).
It looks like parsing the way KWillets suggests then indexing the result is faster than
parse_uint8_swar
but slower thanparse_uint8_fastswar
.int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
union {
uint8_t as_str[4];
uint32_t as_int;
} digits;
memcpy(&digits.as_int, str, sizeof(digits));
digits.as_int ^= 0x30303030;
digits.as_int *= 0x00640a01;
*num = digits.as_str[(len & 0x3) - 1];
return (digits.as_int & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
}
---
volume 25680 bytes
parse_uint8_swar : 1.04 GB/s 404.3 Ma/s 2.47 ns/d
parse_uint8_fastswar : 1.51 GB/s 586.4 Ma/s 1.71 ns/d
parse_uint8_fastswar_tweak : 1.23 GB/s 478.7 Ma/s 2.09 ns/d
parse_uint8_fromchars : 0.48 GB/s 187.8 Ma/s 5.32 ns/d
parse_uint8_naive : 0.65 GB/s 252.8 Ma/s 3.96 ns/d
If you replace indexing with a bit shift, you get performance comparable to
parse_uint8_fastswar
in a simpler function.int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
uint32_t digits;
memcpy(&digits, str, sizeof(digits));
digits ^= 0x30303030;
digits *= 0x00640a01;
*num = (uint8_t)(digits >> (8 * ((len & 0x3) - 1)));
return (digits & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
}
---
volume 25680 bytes
parse_uint8_swar : 1.04 GB/s 404.8 Ma/s 2.47 ns/d
parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d
parse_uint8_fastswar_tweak : 1.50 GB/s 584.0 Ma/s 1.71 ns/d
parse_uint8_fromchars : 0.48 GB/s 186.8 Ma/s 5.35 ns/d
parse_uint8_naive : 0.64 GB/s 249.5 Ma/s 4.01 ns/d
Typo: “but it was only about 40% than the naive approach”
Thanks.
Are you not allowed to use a lookup table?
if len==1, do the ASCII conversion
else if len == 2, append two digits into a 16-bit int, subtract 0x0300 to convert the first byte and lookup in a 2560 item sparsely filled table
else live with a very big table or do the ASCI conversion on the first byte, multiply by 100 and add the len == 2 conversion.
Yes. We can use a lookup table, if only for fun. Do you have an implementation?
Hmm, doesn’t the SWAR method give the wrong return value given an input string like “12>”? The validation of only having ASCII digits seems to be missing a validation of 0x3A-0x3F.
Also there is possibly room to be faster by shifting the result of the multiplication to the right by a variable amount (based on len), rather than shifting the masked input to the left by a variable amount; the computation of the shift amount is more likely to be able to be done in parallel with other work this way.
Also the shift amount can be computed with len directly, rather than (len & 0x3), given that the validity of the result is indicated by the return value (i.e we don’t care about the shift amount when len>3).
The current version should provide complete validation although you are correct that the initial code used in this blog post provided only partial validation.
First, great work.
However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior
However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior.
The C++14 and C17 standards make it defined behaviour. Here is the specification:
Yes, as you maybe saw on twitter, I was referring to C99 (this behaviour is the same in C89 IIRC). Anyway, better be safe, and in all cases it’ll return early if len==0, so that’s that.
Indeed. Part of the reason is the “naive” function has a fixed upper bound on the number of loop iterations (the `r= len & 0x3 clause). As a result (and I checked in compiler explorer), the compiler always unrolls the loop. std::from_chars allows an unbounded number of leading zeros, so it can’t unconditionally unroll the loop (without having a suffix afterwards, I guess).
This doesn’t explain all of the performance difference, but, as usual, often getting performance out of code is being very precise about what the actual requirements for the inputs and outputs are.
Cool post, pure fun and a few neat tricks 🙂
Two comments:
If you want to make this routine “safe” (and hence the comparison really fair), it’s enough to check if the input crosses a page boundary, and take a slow path if it is (or have a SWAR version reading from “the other side”). That should be very branch-predictor safe, so hopefully wouldn’t slow things down much. Some hash functions use this approach.
The benchmarks give some advantage to SWAR considering you’re using a uniform distribution in range 0..255, making the average string length 2.23. There might be use cases where the expected number of digits is smaller, and then SWAR might be less beneficial or even slower.
But I’m nitpicking here 🙂
The first version of our fast JSON parser (simdjson) was page-boundary aware… but this gave us much trouble because people using sanitizers or valgrind would report bugs. I tried arguing back but it became tiresome. So the current version of simdjson use padded strings. It works well. I have a friend who writes string functions for a runtime library, but they get their code to be excluded from valgrind checks and sanitizers… so people don’t see anything… The use case for the function in this blog post is simdzone (parsing zone files for DNS systems) in which case the author knows that the input is padded.
The SWAR version is faster even in the case where we have a sequential input. So it has good chances of being faster in practice. Still people should benchmark on the data they care about!
Is regular old SIMD on the menu? The byte range-checking would certainly be easier, and shuffling of quartets etc.
I actually think SIMD might be more practical.
But one has to have a good SWAR implementation to compare with!!!
Hi,
Since reading up to 4 bytes from str is OK, here is another non-SWAR implementation that runs equally fast for the random and sequential cases:
int parse_uint8_naive_md(const char *str, size_t len, uint8_t *num) {
if (--len > 2) return 0;
static const uint8_t sf[] = {0,0,1,10,100}; // scale factor
uint64_t d1, d2, d3, s1, s2, s3, n;
d1 = (uint64_t)str[0] - (uint64_t)'0';
d2 = (uint64_t)str[1] - (uint64_t)'0';
d3 = (uint64_t)str[2] - (uint64_t)'0';
s1 = (uint64_t)sf[len+2];
s2 = (uint64_t)sf[len+1];
s3 = (uint64_t)sf[len];
n = s1*d1 + s2*d2 + s3*d3;
*num = (uint8_t)n;
return n < 256 && d1<10 && d2<10 && d3<10;
}
Interesting approach, but I think it does not work as is.
What if the input is a digit followed by random bytes, and the len is set to 1? Your return value depends on std[1] and str[2] which can be garbage.
Good catch! At the expense of a couple more checks, it should be fixed with:
return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);
Meanwhile, for what it’s worth, the performance (of all of the implementations here) appears to be sensitive to the mixture of inputs from the [0,255] range and the [256,999] range. It must be primarily due to the short circuiting of the various terms in the return value expressions. (The quick test is to change val to uint16_t and change %256 to %1000 in the benchmarker; otherwise the inputs are all <256 and we don’t need that validation…).
Presumably the performance will also depend on the mixture of invalid inputs as well (cases where len=0, len>3, or non digit chars).
The fast function from the blog post should compile to branchless, except for the length parameter, see the assembly in the blog post. If you use lengths that outside the allowed range ([1,3]) then the performance can vary… But otherwise, it should be flat. The function’s performance is not data dependent. You can see in my table of results.
Make sure your build your code in release mode with full optimizations.
I agree that the fastswar performance is flat between sequential and random. I am finding that using values in [0,999] is a bit faster (about +13%, the mixture is approx 25%/75% valid/invalid), and even faster for values in [256,999] (about +18%, which 100% invalid values). The fastswar return statement uses & and not && so indeed that does not short-circuit… so I’m not sure what is the source of that extra performance for inputs >255.
Scratch that; I was looking at the wrong metric (GB/s)
I don’t think that the following is sufficient because you don’t know the state of s2 and s3, at least in how I interpret the benchmark.
If we get past the first line, len will be 0, 1 or 2 (decremented by 1 from the initial 1, 2 or 3). When len=0, s1,s2,s3=1,0,0, for len=1, s1,s2,s3=10,1,0 and for len=3, s1,s2,s3 = 100,10,1. So s1,s2,s3 are always assigned. If s2 is zero we don’t care about the value of d2 (so the || short circuits), but if s2 is nonzero then we do need to check d2. Similar for s3 & d3. It’s a bit contorted in pursuit of avoiding branches.
I got nerdsniped by this a bit. I think your approach is getting close to optimal, it's hard to beat a constant time operation that has a small constant. Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast hash function (the key is the four byte string cast to a u32). Obviously it trades quite a bit of space for some speed. It could probably be compacted in space with a bit of tweaking. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar, as it should have a smaller constant. To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so: `
lut[simple_hash((unsigned*)(str))];
Lut building code follows:
define SIZE 16384
uint8_t lut[SIZE] = {};
uint32_t simple_hash(uint32_t u32_value) {
const uint32_t shift = 32;
uint64_t hash_val = (uint64_t)u32_value * 1000000000000000000;
hash_val = (hash_val >> 32) % SIZE;
return (uint32_t)hash_val;
}
void build_lut() {
char strings[256*4];
memset(strings, 0, sizeof(strings));
char *iter = strings;
for (int i = 0; i < 256; ++i) {
sprintf(iter, "%d", i);
iter += 4;
}
iter = strings;
for (int i = 0; i < 256; ++i) {
unsigned c = *(unsigned*) iter;
iter += 4;
unsigned idx = simple_hash(c) % SIZE;
lut[idx] = i;
}
}
The formatting got butchered a bit.
I have posted a cleaned up version here:
https://blog.loadzero.com/blog/parse-int-nerdsnipe/
Wonderful article – I really enjoy this kind of work. The only bad part is that I spent a bit of time implementing this in Rust and took some time to fully understand some of your “magic”.
One minor observation – I believe that
((digits.as_int | (0x06060606 + digits.as_int))
doesn’t need thedigits.as_int |
component. My test suite passes fully without it.My rust implementation (can’t seem to get the formatting quite right):`
fn make_u8(s: &str) -> Option<u8> {
if s.is_empty() || s.len() > 3 {
return None;
}
let bytes = s.as_bytes();
// using a union avoids branching on the length to initialize each byte
// of the u32 interpretation.
let mut working = unsafe {
#[repr(C)]
union U {
bytes: [u8; 4],
num: u32,
}
// could use uninit here to avoid initialization...
let mut u = U { num: 0 };
u.bytes[..s.len()].copy_from_slice(&bytes[..s.len()]);
u.num
};
working ^= 0x30303030;
working <<= (4 - s.len()) * 8;
// Wrapping prevents panics on overflow.
let mult = Wrapping(0x640a01) * Wrapping(working);
// unwrap it now (could just use .0 but this is more explicit)
let Wrapping(mult) = mult;
let num = (mult >> 24) as u8;
let all_digits = (0x06060606 + working) & 0xF0F0F0F0 == 0;
let swapped = u32::from_be_bytes(working.to_le_bytes());
if !all_digits || swapped > 0x00020505 {
return None;
}
Some(num)
}
I think the
digits.as_int |
part is necessary to catch bad input that contains characters with byte values 0 to 9, e.g."12\003"
.Oh, I was wrong. These cases are caught by
0x06060606 + digits.as_int
as well.The
digits.as_int |
part is necessary to catch bad input that contains characters with byte values0xCA
to0xCF
. See https://github.com/jcsahnwaldt/parse_uint8_fastswar@Bruce: Your tests pass without the
digits.as_int |
part because they don’t actually test the byte values0xCA
to0xCF
, although they seem to test values up to0xFF
.The reason is that a Rust
str
contains UTF-8 bytes, not arbitrary bytes, and e.g. the value0xFF
gets converted to the two bytes0xC3
and0xBF
.I posted an issue: https://github.com/bmacnaughton/u8-swar/issues/1
It fails at 0xCA
yes, you’re right – thank you. I ignored Unicode characters (and their encoding). I will have some additional tests to verify that (and will add the OR’d digits.as_int back in).
I don’t know what happened to the formatting of my previous post, so here’s a link to the code: https://github.com/bmacnaughton/u8-swar.
I had a go at a simple LUT-based approach, which seems to be faster in most situations I’ve tested, with a few caveats: https://quick-bench.com/q/r0wfNuy0JI0ZWT893FoR0oJHpSc
There’s no hash table here, I just reinterpreted the 3-byte string as the index into a 8MB array containing every possible result. Despite needing so much memory, only small parts of the array are actually accessed in practice (the ones containing valid values) so it doesn’t waste much cache space
There are two versions – one where I’ve transformed the input strings in advance to pad them with nulls, so they can be directly interpreted as an integer, and one that works with the original string data by masking out the unwanted bytes. The version that uses padded strings is almost always faster by a large margin, but it’s kind of cheating so I put in the version with the mask as a more fair comparison
Some things I’ve noticed:
* The results vary quite a lot depending on whether you use or discard the return code. If you comment out ‘benchmark::DoNotOptimize(valid);’ from the benchmarks, the bit-twiddling functions get much faster
* Clang seems to be much better at optimising this than GCC
* The LUT-based approach seems to randomly vary in performance more than the others, which isn’t surprising since it relies much more on data remaining in the cache, and could be affected by other processes on the machine
The results vary quite a lot depending on whether you use or discard the return code.
That’s likely because your functions are inline-able. If you check my benchmark, you will notice that the functions are compiled separately. That’s by design: I do not want inlining as it changes the problem.
Clang seems to be much better at optimising this than GCC
On my test machine, I get with GCC 12 :
And with clang 16…
So it is mixed bag but I don’t see a large difference.
Ah, that explains it. Here’s the same benchmark using __attribute__((noinline)): https://quick-bench.com/q/80Z_47-AYurJITIaCzYCntYjg9A
Also, when I run the benchmarks on my machine (linux running in WSL on a Ryzen 5800X), I get this:
GCC 12:
parse_uint8_fastswar_bob : 1.55 GB/s 604.2 Ma/s 1.66 ns/d
parse_uint8_fastswar : 1.60 GB/s 625.0 Ma/s 1.60 ns/d
parse_uint8_swar : 1.43 GB/s 559.3 Ma/s 1.79 ns/d
parse_uint8_lut_padded : 1.71 GB/s 668.0 Ma/s 1.50 ns/d
parse_uint8_lut_masked : 1.86 GB/s 726.0 Ma/s 1.38 ns/d
parse_uint8_fromchars : 0.61 GB/s 237.8 Ma/s 4.21 ns/d
parse_uint8_naive : 0.91 GB/s 355.9 Ma/s 2.81 ns/d
Clang 15:
parse_uint8_fastswar_bob : 1.73 GB/s 673.2 Ma/s 1.49 ns/d
parse_uint8_fastswar : 1.70 GB/s 659.8 Ma/s 1.52 ns/d
parse_uint8_swar : 1.23 GB/s 478.4 Ma/s 2.09 ns/d
parse_uint8_lut_padded : 2.07 GB/s 802.9 Ma/s 1.25 ns/d
parse_uint8_lut_masked : 1.78 GB/s 691.3 Ma/s 1.45 ns/d
parse_uint8_fromchars : 0.72 GB/s 281.3 Ma/s 3.55 ns/d
parse_uint8_naive : 0.96 GB/s 372.6 Ma/s 2.68 ns/d
I also tried running the benchmark on my laptop (linux with an alder lake processor), and saw some enormous improvements over the SWAR approaches (with clang at least, I’m still not sure why GCC does so badly on my machines). -march=native also makes quite a difference, especially for fastswar_bob:
GCC 12
parse_uint8_fastswar_bob : 1.94 GB/s 755.2 Ma/s 1.32 ns/d
parse_uint8_fastswar : 1.84 GB/s 713.9 Ma/s 1.40 ns/d
parse_uint8_lut_padded : 1.96 GB/s 761.1 Ma/s 1.31 ns/d
parse_uint8_lut_masked : 1.99 GB/s 773.1 Ma/s 1.29 ns/d
GCC 12 with -march=native:
parse_uint8_fastswar_bob : 1.90 GB/s 738.0 Ma/s 1.35 ns/d
parse_uint8_fastswar : 1.85 GB/s 720.3 Ma/s 1.39 ns/d
parse_uint8_lut_padded : 1.95 GB/s 757.7 Ma/s 1.32 ns/d
parse_uint8_lut_masked : 1.99 GB/s 773.7 Ma/s 1.29 ns/d
clang 15:
parse_uint8_fastswar_bob : 1.85 GB/s 719.6 Ma/s 1.39 ns/d
parse_uint8_fastswar : 2.19 GB/s 851.0 Ma/s 1.18 ns/d
parse_uint8_lut_padded : 4.02 GB/s 1560.5 Ma/s 0.64 ns/d
parse_uint8_lut_masked : 2.98 GB/s 1158.7 Ma/s 0.86 ns/d
clang 15 with -march=native:
parse_uint8_fastswar_bob : 2.36 GB/s 918.9 Ma/s 1.09 ns/d
parse_uint8_fastswar : 2.41 GB/s 937.0 Ma/s 1.07 ns/d
parse_uint8_lut_padded : 4.02 GB/s 1561.2 Ma/s 0.64 ns/d
parse_uint8_lut_masked : 3.30 GB/s 1281.0 Ma/s 0.78 ns/d
Here’s the fork I used: https://github.com/PowellNGL/Code-used-on-Daniel-Lemire-s-blog
Added with credit and some small modifications. I can confirm that GCC is slower. It seems to use more instructions to get the job done when using a LUT (about 5 extra instructions).
Here’s another Rust version:
fn parse_uint8_fastswar(b: &[u8]) -> Option<u8> {
if b.len() == 0 || b.len() > 3 { return None; }
let p = b.as_ptr() as *const u32;
let mut digits = unsafe { p.read_unaligned() };
digits ^= 0x30303030;
digits <<= (4 - b.len()) * 8;
let num = ((digits.wrapping_mul(0x640a01)) >> 24) as u8;
let all_digits = ((digits | (digits.wrapping_add(0x06060606))) & 0xF0F0F0F0) == 0;
(all_digits && digits.swap_bytes() <= 0x020505).then_some(num)
}
According to https://godbolt.org/z/Ts8xrqnc7, the resulting assembly code is very similar to the C version:
lea rax, [rsi - 4]
cmp rax, -3
jae .LBB3_2
xor eax, eax
ret
.LBB3_2:
mov eax, 808464432
xor eax, dword ptr [rdi]
neg sil
shl sil, 3
mov ecx, esi
shl eax, cl
imul edx, eax, 6556161
shr edx, 24
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
bswap eax
cmp eax, 132358
setb al
and al, cl
ret
I haven’t run benchmarks yet. I hope this might be slightly faster than C because
Option<u8>
is returned in two registers, so there’s no write through a pointer.See https://github.com/jcsahnwaldt/parse_uint8_fastswar. I’ll add tests and benchmarks later.
The benchmark results are somewhat inconclusive. Depending on CPU (Intel Xeon, Apple M2, Amazon Graviton) and compiler (GCC, Clang):
Sometimes Rust is faster than C, sometimes slower
Sometimes a C function returning an option in two registers is faster than writing the result through a pointer, sometimes slower
Sometimes
fastswar
is faster thanfastswar_bob
, sometimes slowerSometimes one of the
fastswar_*
versions is as fast as thelut
version, but usuallylut
is the fastestSee the fork at https://github.com/jcsahnwaldt/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28 for details.
Here is the thing. Your code sucks!
It just causes UB and after that – your code is not C/C++. It is writen for SUPER specific case (little-endian architecture, specific compiler which ignores UB, caused by reading more numbers that is allowed, etc).
And, more than that, it is absoultely unreadable. I just have written the most stupid code I could imagine:
static int parse_uint8_switch_case(const char *str, size_t len, uint8_t *num) {
uint8_t hi, mid, lo;
#define as_u8(x) ((uint8_t)((x) - '0'))
switch(len) {
case 1:
*num = as_u8(str[0]);
return *num < 10;
case 2:
hi = as_u8(str[0]);
lo = as_u8(str[1]);
*num = hi * 10 + lo;
return (hi < 10) && (lo < 10);
case 3:
hi = as_u8(str[0]);
mid = as_u8(str[1]);
lo = as_u8(str[2]);
*num = hi * 100 + mid * 10 + lo;
return (hi < 10) && (mid < 10) && (lo < 10);
default:
return 0;
}
#undef as_u8
}
And then just launched it here
The results on the screenshot
So. I cannot read your code, it causes UB and after all it just slow)
Here is the thing. Your code sucks!
The blog post addresses your criticism, explicitly so:
You claim that there is UB, but you don’t explain why.
As for code readability, that’s an important issue, of course. But then the blog post is specifically about optimizing for performance.
For your benchmark, you use sequential numbers. The blog posts explains that the approach being presented is only very beneficial when the inputs are unpredictable, it says so: So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable.
If your input is predictable, then sure, a naive implementation works. That’s already covered by the blog post.
On my MacBook (Apple M2, LLVM14):
On GCC12 with an Intel Ice Lake processor.
As you can see, your code is running at about half the speed as parse_uint8_fastswar_bob and parse_uint8_lut. In fact, it has the same performance as parse_uint8_naive in my tests.
As for the criticism that the code posted on my blog is not ready for production, please see my terms of use. It is by design. The code presented in my blog posts is not meant to be copied and pasted into your projects.
We can go faster if we limit dependencies(?) The idea is to ignore trash bytes and not shift them off. Given correct input, this is considerably faster on my machine. Obviously, it lacks proper error checking, but I think it’s an interesting enough approach to get input from others(?)
__attribute__((noinline))
static int parse_int8(const char *str, size_t len, uint8_t *num)
{
const uint32_t shr = ((len << 3) - 8) & 0x18;
uint32_t dgts;
memcpy(&dgts, str, sizeof(dgts));
dgts &= 0x000f0f0flu;
*num = (uint8_t)((0x640a01 * dgts) >> shr);
return 1;
}
It is about 25% faster than the fastest approach, when using random inputs.