28th November 2023, 58 min read

Parsing 8-bit integers quickly

67 thoughts on “Parsing 8-bit integers quickly”

D. Bohdan says:

November 28, 2023 at 9:22 pm

Neat! I have noticed a small mistake:

memcpy(&digits.as_int, str, sizeof(digits));

should be

memcpy(&digits.as_str, str, sizeof(digits));
1. Christopher Sahnwaldt says:
  
  December 3, 2023 at 6:27 pm
  
  I think you’re right. Currently, the code doesn’t use the as_str field at all. Why do even need the digits union? Am I missing something?
2. dnnfndmsjs says:
  
  December 30, 2023 at 6:46 pm
  
  Nope. The whole trick with enum here is UB. Compiler is allowed to optimize it the way it wants. The proper solution is std::bit_cast. And memcpy is a C function that is not a constexpr, while std::bit_cast while is, allowing to make the whole function a constexpr.
Verisimilitude says:

November 28, 2023 at 9:50 pm

This line, at the very least, means the code only works for little endian machines, right:

digits.as_int <<= (4 – (len & 0x3)) * 8;
1. Daniel Lemire says:
  
  November 29, 2023 at 6:03 am
  
  For Big Endian, you just need to reverse the bytes, which is typically just one instruction.
  
  Big endian systems are vanishingly rare.
  1. Verisimilitude says:
    
    November 29, 2023 at 6:10 am
    
    Does the C language expression for such consistently map to just one instruction?
    
    The rarity of big endian is no excuse to disregard it in a supposedly high-level language.
    1. ShinyHappyREM says:
      
      November 29, 2023 at 1:20 pm
      
      > The rarity of big endian is no excuse to disregard it in a supposedly high-level language
      
      Do you also consider systems that have less/more than 8 bits in a byte or machine word? What about systems that don’t use two’s complement?
      1. Verisimilitude says:
        
        December 1, 2023 at 6:16 am
        
        Yes, I do, and it can be a massive pain.
    2. Daniel Lemire says:
      
      November 29, 2023 at 1:55 pm
      
      Please see the first rule on my terms of use: https://lemire.me/blog/terms-of-use/
      
      I copied and paste code form your blog post and it does not compile, it contained a massive security hole, it contained a terrible bug or it crashed my computer. Can I blame you?
      No. Lemire’s rule: Blogging is literature, not engineering. Code taken from a blog post should not be expected to work, it is meant to illustrate an idea. Don’t build production systems by copying and pasting random code from the Internet. It will not end well.
      
      I do build a lot of code that is meant to be deployed in production systems. The code used on my blog post is not meant for that. I do not run tests on mainframe platforms, for example.
      1. Verisimilitude says:
        
        December 1, 2023 at 6:15 am
        
        That’s fair enough. I wasn’t trying to be a jerk.
  2. Jean-Marc Bourguet says:
    
    November 29, 2023 at 2:20 pm
    
    Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
    
    Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
    1. Daniel Lemire says:
      
      November 29, 2023 at 2:31 pm
      
      Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?
      
      It is almost surely not sufficient. Can you prove that it is?
      
      Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?
      
      Can you share your full code? I’d be happy to benchmark it.
      1. Jean-Marc Bourguet says:
        
        November 29, 2023 at 6:04 pm
        
        Now that I’ve had time to look at it, the issue with simple multiplication for big endian is there will be unwanted carry. So it doesn’t work.
        
        Combining the two shifts is essentially the same idea as KWillets and suffers the same issue of validation.
        
        On the other hand, if I’m not confused again, you can use
        
        uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;
        
        which is simpler but doesn’t reduces the depth of the chain of dependent instructions, so I fear any gain will depend on the micro-architecture.
        
        And you could use
        
        *num = (uint8_t)((UINT32_C(0x640a01) * digits.as_int) >> 24);
        
        which probably doesn’t change anything on 64-bit computers but need only 32 bits x 32 bits -> 32 bits and thus could be useful on non-64-bit machines.
        
        Daniel Lemire says:
        
        November 29, 2023 at 10:39 pm
        
        Yep. I have applied your proposals, see the credit section of the blog post. I will also mention you in the code.
D. Bohdan says:

November 28, 2023 at 10:10 pm

Sorry, never mind what I wrote about a mistake. I need sleep.
M says:

November 28, 2023 at 10:51 pm

Isn’t “y < 256” comparison useless in return? Because value assigned to y is always and’ed with 0xff.
1. Daniel Lemire says:
  
  November 29, 2023 at 6:02 am
  
  Thanks. It was a small mistake.
2. D. Bohdan says:
  
  November 29, 2023 at 9:09 am
  
  I wondered whether and how much removing the comparison would speed up parsing. It turns out, a fair bit. On my machine (x86_64, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) the benchmark goes from
  
  parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d
  
  to
  
  parse_uint8_fastswar : 1.64 GB/s 638.6 Ma/s 1.57 ns/d
  
  So GCC didn’t optimize it away.
KWillets says:

November 29, 2023 at 1:04 am

If I’ve got my arithmetic right (or left in this case) the lower 3 bytes of 640a01 * (the unshifted digits) will contain three quantities, in little-endian byte order:

the first digit times one
the first digit times ten plus the second digit
the first digit times 100 plus the second digit times 10 plus the third digit.

Pick which one to output from the length argument.

This technique also tolerates cruft on the end, so you can unconditionally load 3-4 bytes and still pull the correct value even if bytes to the left overflow (eg “9” maps to 9/90/900, but the 900 doesn’t overflow into the part we want).
1. Daniel Lemire says:
  
  November 29, 2023 at 2:42 am
  
  Sounds brilliant. If I can make it work, I will update with credit to you.
  1. Daniel Lemire says:
    
    November 29, 2023 at 6:02 am
    
    The idea works but I am not sure it could be faster.
    1. KWillets says:
      
      November 29, 2023 at 4:10 pm
      
      I see; there are a lot of validation cycles in there that I didn’t consider.
      
      One thought on validation is that a well-formed ASCII integer will have the intermediate lanes bounded by 9/99/256 up to the output lane (eg “91” will have the first two lanes bounded but 910 > 256 in the third). A non-digit input should fail in one of the lanes (using 0/0/0 lower bound).
    2. D. Bohdan says:
      
      November 29, 2023 at 4:38 pm
      
      It looks like parsing the way KWillets suggests then indexing the result is faster than parse_uint8_swar but slower than parse_uint8_fastswar.
      
      int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) { union { uint8_t as_str[4]; uint32_t as_int; } digits;
      memcpy(&digits.as_int, str, sizeof(digits)); digits.as_int ^= 0x30303030; digits.as_int *= 0x00640a01; *num = digits.as_str[(len & 0x3) - 1]; return (digits.as_int & 0xf0f0f0f0) == 0 && len != 0 && len < 4; } ---
      volume 25680 bytes parse_uint8_swar : 1.04 GB/s 404.3 Ma/s 2.47 ns/d parse_uint8_fastswar : 1.51 GB/s 586.4 Ma/s 1.71 ns/d parse_uint8_fastswar_tweak : 1.23 GB/s 478.7 Ma/s 2.09 ns/d parse_uint8_fromchars : 0.48 GB/s 187.8 Ma/s 5.32 ns/d parse_uint8_naive : 0.65 GB/s 252.8 Ma/s 3.96 ns/d
      
      If you replace indexing with a bit shift, you get performance comparable to parse_uint8_fastswar in a simpler function.
      
      int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) { uint32_t digits;
      memcpy(&digits, str, sizeof(digits)); digits ^= 0x30303030; digits *= 0x00640a01; *num = (uint8_t)(digits >> (8 * ((len & 0x3) - 1))); return (digits & 0xf0f0f0f0) == 0 && len != 0 && len < 4; } ---
      volume 25680 bytes parse_uint8_swar : 1.04 GB/s 404.8 Ma/s 2.47 ns/d parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d parse_uint8_fastswar_tweak : 1.50 GB/s 584.0 Ma/s 1.71 ns/d parse_uint8_fromchars : 0.48 GB/s 186.8 Ma/s 5.35 ns/d parse_uint8_naive : 0.64 GB/s 249.5 Ma/s 4.01 ns/d
walter says:

November 29, 2023 at 5:19 am

Typo: “but it was only about 40% than the naive approach”
1. Daniel Lemire says:
  
  November 29, 2023 at 6:01 am
  
  Thanks.
Luni says:

November 29, 2023 at 7:57 am

Are you not allowed to use a lookup table?

if len==1, do the ASCII conversion
else if len == 2, append two digits into a 16-bit int, subtract 0x0300 to convert the first byte and lookup in a 2560 item sparsely filled table
else live with a very big table or do the ASCI conversion on the first byte, multiply by 100 and add the len == 2 conversion.
1. Daniel Lemire says:
  
  November 29, 2023 at 1:57 pm
  
  Yes. We can use a lookup table, if only for fun. Do you have an implementation?
Samuel Lee says:

November 29, 2023 at 8:24 am

Hmm, doesn’t the SWAR method give the wrong return value given an input string like “12>”? The validation of only having ASCII digits seems to be missing a validation of 0x3A-0x3F.

Also there is possibly room to be faster by shifting the result of the multiplication to the right by a variable amount (based on len), rather than shifting the masked input to the left by a variable amount; the computation of the shift amount is more likely to be able to be done in parallel with other work this way.
Also the shift amount can be computed with len directly, rather than (len & 0x3), given that the validity of the result is indicated by the return value (i.e we don’t care about the shift amount when len>3).
1. Daniel Lemire says:
  
  November 29, 2023 at 1:58 pm
  
  The current version should provide complete validation although you are correct that the initial code used in this blog post provided only partial validation.
Jylam says:

November 29, 2023 at 10:17 am

First, great work.

However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior
1. Daniel Lemire says:
  
  November 29, 2023 at 2:46 pm
  
  However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior.
  
  The C++14 and C17 standards make it defined behaviour. Here is the specification:
  
  The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 x 2^E2, reduced modulo one more than the maximum value representable in the result type.
  1. Jylam says:
    
    November 29, 2023 at 7:12 pm
    
    Yes, as you maybe saw on twitter, I was referring to C99 (this behaviour is the same in C89 IIRC). Anyway, better be safe, and in all cases it’ll return early if len==0, so that’s that.
Greg says:

November 29, 2023 at 11:46 pm

The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library.

Indeed. Part of the reason is the “naive” function has a fixed upper bound on the number of loop iterations (the `r= len & 0x3 clause). As a result (and I checked in compiler explorer), the compiler always unrolls the loop. std::from_chars allows an unbounded number of leading zeros, so it can’t unconditionally unroll the loop (without having a suffix afterwards, I guess).

This doesn’t explain all of the performance difference, but, as usual, often getting performance out of code is being very precise about what the actual requirements for the inputs and outputs are.
Marcin Zukowski says:

November 30, 2023 at 8:10 am

Cool post, pure fun and a few neat tricks 🙂

Two comments:

If you want to make this routine “safe” (and hence the comparison really fair), it’s enough to check if the input crosses a page boundary, and take a slow path if it is (or have a SWAR version reading from “the other side”). That should be very branch-predictor safe, so hopefully wouldn’t slow things down much. Some hash functions use this approach.
The benchmarks give some advantage to SWAR considering you’re using a uniform distribution in range 0..255, making the average string length 2.23. There might be use cases where the expected number of digits is smaller, and then SWAR might be less beneficial or even slower.

But I’m nitpicking here 🙂
1. Daniel Lemire says:
  
  November 30, 2023 at 1:35 pm
  1. The first version of our fast JSON parser (simdjson) was page-boundary aware… but this gave us much trouble because people using sanitizers or valgrind would report bugs. I tried arguing back but it became tiresome. So the current version of simdjson use padded strings. It works well. I have a friend who writes string functions for a runtime library, but they get their code to be excluded from valgrind checks and sanitizers… so people don’t see anything… The use case for the function in this blog post is simdzone (parsing zone files for DNS systems) in which case the author knows that the input is padded.
  2. The SWAR version is faster even in the case where we have a sequential input. So it has good chances of being faster in practice. Still people should benchmark on the data they care about!
  1. KWillets says:
    
    November 30, 2023 at 6:10 pm
    
    Is regular old SIMD on the menu? The byte range-checking would certainly be easier, and shuffling of quartets etc.
    1. Daniel Lemire says:
      
      November 30, 2023 at 6:39 pm
      
      I actually think SIMD might be more practical.
      
      But one has to have a good SWAR implementation to compare with!!!
Michael Dunphy says:

December 1, 2023 at 6:29 am

Hi,

Since reading up to 4 bytes from str is OK, here is another non-SWAR implementation that runs equally fast for the random and sequential cases:

int parse_uint8_naive_md(const char *str, size_t len, uint8_t *num) { if (--len > 2) return 0; static const uint8_t sf[] = {0,0,1,10,100}; // scale factor uint64_t d1, d2, d3, s1, s2, s3, n; d1 = (uint64_t)str[0] - (uint64_t)'0'; d2 = (uint64_t)str[1] - (uint64_t)'0'; d3 = (uint64_t)str[2] - (uint64_t)'0'; s1 = (uint64_t)sf[len+2]; s2 = (uint64_t)sf[len+1]; s3 = (uint64_t)sf[len]; n = s1*d1 + s2*d2 + s3*d3; *num = (uint8_t)n; return n < 256 && d1<10 && d2<10 && d3<10; }
1. Daniel Lemire says:
  
  December 1, 2023 at 1:49 pm
  
  Interesting approach, but I think it does not work as is.
  
  What if the input is a digit followed by random bytes, and the len is set to 1? Your return value depends on std[1] and str[2] which can be garbage.
  1. Michael Dunphy says:
    
    December 2, 2023 at 6:51 am
    
    Good catch! At the expense of a couple more checks, it should be fixed with:
    
    return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);
    
    Meanwhile, for what it’s worth, the performance (of all of the implementations here) appears to be sensitive to the mixture of inputs from the [0,255] range and the [256,999] range. It must be primarily due to the short circuiting of the various terms in the return value expressions. (The quick test is to change val to uint16_t and change %256 to %1000 in the benchmarker; otherwise the inputs are all <256 and we don’t need that validation…).
    
    Presumably the performance will also depend on the mixture of invalid inputs as well (cases where len=0, len>3, or non digit chars).
    1. Daniel Lemire says:
      
      December 2, 2023 at 8:00 pm
      
      The fast function from the blog post should compile to branchless, except for the length parameter, see the assembly in the blog post. If you use lengths that outside the allowed range ([1,3]) then the performance can vary… But otherwise, it should be flat. The function’s performance is not data dependent. You can see in my table of results.
      
      Make sure your build your code in release mode with full optimizations.
      1. Michael Dunphy says:
        
        December 2, 2023 at 8:45 pm
        
        I agree that the fastswar performance is flat between sequential and random. I am finding that using values in [0,999] is a bit faster (about +13%, the mixture is approx 25%/75% valid/invalid), and even faster for values in [256,999] (about +18%, which 100% invalid values). The fastswar return statement uses & and not && so indeed that does not short-circuit… so I’m not sure what is the source of that extra performance for inputs >255.
        
        Michael Dunphy says:
        
        December 3, 2023 at 12:31 am
        
        Scratch that; I was looking at the wrong metric (GB/s)
    2. Daniel Lemire says:
      
      December 4, 2023 at 9:36 pm
      
      I don’t think that the following is sufficient because you don’t know the state of s2 and s3, at least in how I interpret the benchmark.
      
      return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);
      1. Michael Dunphy says:
        
        December 5, 2023 at 12:52 am
        
        If we get past the first line, len will be 0, 1 or 2 (decremented by 1 from the initial 1, 2 or 3). When len=0, s1,s2,s3=1,0,0, for len=1, s1,s2,s3=10,1,0 and for len=3, s1,s2,s3 = 100,10,1. So s1,s2,s3 are always assigned. If s2 is zero we don’t care about the value of d2 (so the || short circuits), but if s2 is nonzero then we do need to check d2. Similar for s3 & d3. It’s a bit contorted in pursuit of avoiding branches.
Jason M says:

December 2, 2023 at 1:44 am

I got nerdsniped by this a bit. I think your approach is getting close to optimal, it's hard to beat a constant time operation that has a small constant. Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast hash function (the key is the four byte string cast to a u32). Obviously it trades quite a bit of space for some speed. It could probably be compacted in space with a bit of tweaking. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar, as it should have a smaller constant. To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so: `
lut[simple_hash((unsigned*)(str))];

Lut building code follows:

define SIZE 16384

uint8_t lut[SIZE] = {};

uint32_t simple_hash(uint32_t u32_value) {
const uint32_t shift = 32;
uint64_t hash_val = (uint64_t)u32_value * 1000000000000000000;
hash_val = (hash_val >> 32) % SIZE;
return (uint32_t)hash_val;
}

void build_lut() {
char strings[256*4];
memset(strings, 0, sizeof(strings));
char *iter = strings;
for (int i = 0; i < 256; ++i) {
sprintf(iter, "%d", i);
iter += 4;
}
iter = strings;
for (int i = 0; i < 256; ++i) {
unsigned c = *(unsigned*) iter;
iter += 4;
unsigned idx = simple_hash(c) % SIZE;
lut[idx] = i;
}
}
1. Jason M says:
  
  December 2, 2023 at 7:11 am
  
  The formatting got butchered a bit.
  
  I have posted a cleaned up version here:
  
  https://blog.loadzero.com/blog/parse-int-nerdsnipe/
Bruce A MacNaughton says:

December 3, 2023 at 2:26 pm

Wonderful article – I really enjoy this kind of work. The only bad part is that I spent a bit of time implementing this in Rust and took some time to fully understand some of your “magic”.

One minor observation – I believe that ((digits.as_int | (0x06060606 + digits.as_int)) doesn’t need the digits.as_int | component. My test suite passes fully without it.

My rust implementation (can’t seem to get the formatting quite right):`

fn make_u8(s: &str) -> Option<u8> { if s.is_empty() || s.len() > 3 { return None; } let bytes = s.as_bytes();
// using a union avoids branching on the length to initialize each byte // of the u32 interpretation. let mut working = unsafe { #[repr(C)] union U { bytes: [u8; 4], num: u32, } // could use uninit here to avoid initialization... let mut u = U { num: 0 }; u.bytes[..s.len()].copy_from_slice(&bytes[..s.len()]); u.num }; working ^= 0x30303030; working <<= (4 - s.len()) * 8; // Wrapping prevents panics on overflow. let mult = Wrapping(0x640a01) * Wrapping(working); // unwrap it now (could just use .0 but this is more explicit) let Wrapping(mult) = mult; let num = (mult >> 24) as u8; let all_digits = (0x06060606 + working) & 0xF0F0F0F0 == 0; let swapped = u32::from_be_bytes(working.to_le_bytes()); if !all_digits || swapped > 0x00020505 { return None; }
Some(num)

}
1. Christopher Sahnwaldt says:
  
  December 3, 2023 at 10:06 pm
  
  I think the digits.as_int | part is necessary to catch bad input that contains characters with byte values 0 to 9, e.g. "12\003".
  1. Christopher Sahnwaldt says:
    
    December 3, 2023 at 10:13 pm
    
    Oh, I was wrong. These cases are caught by 0x06060606 + digits.as_int as well.
    1. Christopher Sahnwaldt says:
      
      December 4, 2023 at 10:33 pm
      
      The digits.as_int | part is necessary to catch bad input that contains characters with byte values 0xCA to 0xCF. See https://github.com/jcsahnwaldt/parse_uint8_fastswar
      1. Christopher Sahnwaldt says:
        
        December 6, 2023 at 6:58 pm
        
        @Bruce: Your tests pass without the digits.as_int | part because they don’t actually test the byte values 0xCA to 0xCF, although they seem to test values up to 0xFF.
        
        The reason is that a Rust str contains UTF-8 bytes, not arbitrary bytes, and e.g. the value 0xFF gets converted to the two bytes 0xC3 and 0xBF.
        
        I posted an issue: https://github.com/bmacnaughton/u8-swar/issues/1
        
        Bruce A MacNaughton says:
        
        December 7, 2023 at 1:40 am
        
        It fails at 0xCA
      2. Bruce A MacNaughton says:
        
        December 7, 2023 at 1:38 am
        
        yes, you’re right – thank you. I ignored Unicode characters (and their encoding). I will have some additional tests to verify that (and will add the OR’d digits.as_int back in).
Bruce A MacNaughton says:

December 3, 2023 at 3:04 pm

I don’t know what happened to the formatting of my previous post, so here’s a link to the code: https://github.com/bmacnaughton/u8-swar.
Nick Powell says:

December 4, 2023 at 3:10 pm

I had a go at a simple LUT-based approach, which seems to be faster in most situations I’ve tested, with a few caveats: https://quick-bench.com/q/r0wfNuy0JI0ZWT893FoR0oJHpSc

There’s no hash table here, I just reinterpreted the 3-byte string as the index into a 8MB array containing every possible result. Despite needing so much memory, only small parts of the array are actually accessed in practice (the ones containing valid values) so it doesn’t waste much cache space

There are two versions – one where I’ve transformed the input strings in advance to pad them with nulls, so they can be directly interpreted as an integer, and one that works with the original string data by masking out the unwanted bytes. The version that uses padded strings is almost always faster by a large margin, but it’s kind of cheating so I put in the version with the mask as a more fair comparison

Some things I’ve noticed:
* The results vary quite a lot depending on whether you use or discard the return code. If you comment out ‘benchmark::DoNotOptimize(valid);’ from the benchmarks, the bit-twiddling functions get much faster
* Clang seems to be much better at optimising this than GCC
* The LUT-based approach seems to randomly vary in performance more than the others, which isn’t surprising since it relies much more on data remaining in the cache, and could be affected by other processes on the machine
1. Daniel Lemire says:
  
  December 4, 2023 at 3:32 pm
  The results vary quite a lot depending on whether you use or discard the return code.
  
  That’s likely because your functions are inline-able. If you check my benchmark, you will notice that the functions are compiled separately. That’s by design: I do not want inlining as it changes the problem.
  
  Clang seems to be much better at optimising this than GCC
  
  On my test machine, I get with GCC 12 :
```
parse_uint8_fastswar_bob                 :   1.14 GB/s  443.3 Ma/s   2.26 ns/d   3.20 GHz   7.21 c/d  33.01 i/d   2.80 c/b  12.83 i/b   4.58 i/c
parse_uint8_fastswar                     :   1.04 GB/s  403.3 Ma/s   2.48 ns/d   3.20 GHz   7.92 c/d  36.01 i/d   3.08 c/b  13.99 i/b   4.54 i/
```
  And with clang 16…
```
parse_uint8_fastswar_bob                 :   1.11 GB/s  430.2 Ma/s   2.32 ns/d   3.20 GHz   7.43 c/d  30.01 i/d   2.89 c/b  11.66 i/b   4.04 i/c
parse_uint8_fastswar                     :   1.14 GB/s  444.3 Ma/s   2.25 ns/d   3.20 GHz   7.19 c/d  32.01 i/d   2.79 c/b  12.44 i/b   4.45 i/
```
  So it is mixed bag but I don’t see a large difference.
  1. Nick Powell says:
    
    December 4, 2023 at 4:20 pm
    
    That’s likely because your functions are inline-able.
    
    Ah, that explains it. Here’s the same benchmark using __attribute__((noinline)): https://quick-bench.com/q/80Z_47-AYurJITIaCzYCntYjg9A
    
    Also, when I run the benchmarks on my machine (linux running in WSL on a Ryzen 5800X), I get this:
    
    GCC 12: parse_uint8_fastswar_bob : 1.55 GB/s 604.2 Ma/s 1.66 ns/d parse_uint8_fastswar : 1.60 GB/s 625.0 Ma/s 1.60 ns/d parse_uint8_swar : 1.43 GB/s 559.3 Ma/s 1.79 ns/d parse_uint8_lut_padded : 1.71 GB/s 668.0 Ma/s 1.50 ns/d parse_uint8_lut_masked : 1.86 GB/s 726.0 Ma/s 1.38 ns/d parse_uint8_fromchars : 0.61 GB/s 237.8 Ma/s 4.21 ns/d parse_uint8_naive : 0.91 GB/s 355.9 Ma/s 2.81 ns/d
    Clang 15: parse_uint8_fastswar_bob : 1.73 GB/s 673.2 Ma/s 1.49 ns/d parse_uint8_fastswar : 1.70 GB/s 659.8 Ma/s 1.52 ns/d parse_uint8_swar : 1.23 GB/s 478.4 Ma/s 2.09 ns/d parse_uint8_lut_padded : 2.07 GB/s 802.9 Ma/s 1.25 ns/d parse_uint8_lut_masked : 1.78 GB/s 691.3 Ma/s 1.45 ns/d parse_uint8_fromchars : 0.72 GB/s 281.3 Ma/s 3.55 ns/d parse_uint8_naive : 0.96 GB/s 372.6 Ma/s 2.68 ns/d
    1. Nick Powell says:
      
      December 4, 2023 at 7:55 pm
      
      I also tried running the benchmark on my laptop (linux with an alder lake processor), and saw some enormous improvements over the SWAR approaches (with clang at least, I’m still not sure why GCC does so badly on my machines). -march=native also makes quite a difference, especially for fastswar_bob:
      
      GCC 12 parse_uint8_fastswar_bob : 1.94 GB/s 755.2 Ma/s 1.32 ns/d parse_uint8_fastswar : 1.84 GB/s 713.9 Ma/s 1.40 ns/d parse_uint8_lut_padded : 1.96 GB/s 761.1 Ma/s 1.31 ns/d parse_uint8_lut_masked : 1.99 GB/s 773.1 Ma/s 1.29 ns/d
      GCC 12 with -march=native: parse_uint8_fastswar_bob : 1.90 GB/s 738.0 Ma/s 1.35 ns/d parse_uint8_fastswar : 1.85 GB/s 720.3 Ma/s 1.39 ns/d parse_uint8_lut_padded : 1.95 GB/s 757.7 Ma/s 1.32 ns/d parse_uint8_lut_masked : 1.99 GB/s 773.7 Ma/s 1.29 ns/d clang 15: parse_uint8_fastswar_bob : 1.85 GB/s 719.6 Ma/s 1.39 ns/d parse_uint8_fastswar : 2.19 GB/s 851.0 Ma/s 1.18 ns/d parse_uint8_lut_padded : 4.02 GB/s 1560.5 Ma/s 0.64 ns/d parse_uint8_lut_masked : 2.98 GB/s 1158.7 Ma/s 0.86 ns/d
      clang 15 with -march=native: parse_uint8_fastswar_bob : 2.36 GB/s 918.9 Ma/s 1.09 ns/d parse_uint8_fastswar : 2.41 GB/s 937.0 Ma/s 1.07 ns/d parse_uint8_lut_padded : 4.02 GB/s 1561.2 Ma/s 0.64 ns/d parse_uint8_lut_masked : 3.30 GB/s 1281.0 Ma/s 0.78 ns/d
      
      Here’s the fork I used: https://github.com/PowellNGL/Code-used-on-Daniel-Lemire-s-blog
      1. Daniel Lemire says:
        
        December 4, 2023 at 10:34 pm
        
        Added with credit and some small modifications. I can confirm that GCC is slower. It seems to use more instructions to get the job done when using a LUT (about 5 extra instructions).
Christopher Sahnwaldt says:

December 8, 2023 at 7:39 pm

Here’s another Rust version:

fn parse_uint8_fastswar(b: &[u8]) -> Option<u8> { if b.len() == 0 || b.len() > 3 { return None; } let p = b.as_ptr() as *const u32; let mut digits = unsafe { p.read_unaligned() }; digits ^= 0x30303030; digits <<= (4 - b.len()) * 8; let num = ((digits.wrapping_mul(0x640a01)) >> 24) as u8; let all_digits = ((digits | (digits.wrapping_add(0x06060606))) & 0xF0F0F0F0) == 0; (all_digits && digits.swap_bytes() <= 0x020505).then_some(num) }

According to https://godbolt.org/z/Ts8xrqnc7, the resulting assembly code is very similar to the C version:

lea rax, [rsi - 4] cmp rax, -3 jae .LBB3_2 xor eax, eax ret .LBB3_2: mov eax, 808464432 xor eax, dword ptr [rdi] neg sil shl sil, 3 mov ecx, esi shl eax, cl imul edx, eax, 6556161 shr edx, 24 lea ecx, [rax + 101058054] or ecx, eax test ecx, -252645136 sete cl bswap eax cmp eax, 132358 setb al and al, cl ret

I haven’t run benchmarks yet. I hope this might be slightly faster than C because Option<u8> is returned in two registers, so there’s no write through a pointer.
1. Christopher Sahnwaldt says:
  
  December 8, 2023 at 7:41 pm
  
  See https://github.com/jcsahnwaldt/parse_uint8_fastswar. I’ll add tests and benchmarks later.
  1. Christopher Sahnwaldt says:
    
    December 10, 2023 at 12:25 pm
    
    The benchmark results are somewhat inconclusive. Depending on CPU (Intel Xeon, Apple M2, Amazon Graviton) and compiler (GCC, Clang):
    
    Sometimes Rust is faster than C, sometimes slower
    Sometimes a C function returning an option in two registers is faster than writing the result through a pointer, sometimes slower
    Sometimes fastswar is faster than fastswar_bob, sometimes slower
    Sometimes one of the fastswar_* versions is as fast as the lut version, but usually lut is the fastest
    
    See the fork at https://github.com/jcsahnwaldt/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28 for details.
DoesNotMatter says:

January 1, 2024 at 6:00 pm

Here is the thing. Your code sucks!

It just causes UB and after that – your code is not C/C++. It is writen for SUPER specific case (little-endian architecture, specific compiler which ignores UB, caused by reading more numbers that is allowed, etc).

And, more than that, it is absoultely unreadable. I just have written the most stupid code I could imagine:

static int parse_uint8_switch_case(const char *str, size_t len, uint8_t *num) { uint8_t hi, mid, lo;
#define as_u8(x) ((uint8_t)((x) - '0')) switch(len) { case 1: *num = as_u8(str[0]); return *num < 10; case 2: hi = as_u8(str[0]); lo = as_u8(str[1]); *num = hi * 10 + lo; return (hi < 10) && (lo < 10); case 3: hi = as_u8(str[0]); mid = as_u8(str[1]); lo = as_u8(str[2]); *num = hi * 100 + mid * 10 + lo; return (hi < 10) && (mid < 10) && (lo < 10); default: return 0; }
#undef as_u8 }

And then just launched it here

The results on the screenshot

So. I cannot read your code, it causes UB and after all it just slow)
1. Daniel Lemire says:
  
  January 1, 2024 at 8:56 pm
  Here is the thing. Your code sucks!
  
  The blog post addresses your criticism, explicitly so:
  - Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer).
  - In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.
  You claim that there is UB, but you don’t explain why.
  
  As for code readability, that’s an important issue, of course. But then the blog post is specifically about optimizing for performance.
  
  For your benchmark, you use sequential numbers. The blog posts explains that the approach being presented is only very beneficial when the inputs are unpredictable, it says so: So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable.
  
  If your input is predictable, then sure, a naive implementation works. That’s already covered by the blog post.
  
  On my MacBook (Apple M2, LLVM14):
```
volume 51473 bytes
parse_uint8_switch_case                  :   0.76 GB/s  295.6 Ma/s   3.38 ns/d
parse_uint8_fastswar_bob                 :   1.81 GB/s  702.8 Ma/s   1.42 ns/d
parse_uint8_fastswar                     :   1.51 GB/s  585.4 Ma/s   1.71 ns/d
parse_uint8_lut                          :   1.81 GB/s  702.8 Ma/s   1.42 ns/d
parse_uint8_swar                         :   1.67 GB/s  647.8 Ma/s   1.54 ns/d
parse_uint8_fromchars                    :   0.39 GB/s  150.9 Ma/s   6.62 ns/d
parse_uint8_naive                        :   0.76 GB/s  294.7 Ma/s   3.39 ns/d
```
  On GCC12 with an Intel Ice Lake processor.
```
parse_uint8_switch_case                  :   0.54 GB/s  211.6 Ma/s   4.73 ns/d   3.19 GHz  15.09 c/d  34.07 i/d   5.86 c/b  13.24 i/b   2.26 i/c
parse_uint8_fastswar_bob                 :   1.14 GB/s  443.0 Ma/s   2.26 ns/d   3.20 GHz   7.21 c/d  33.01 i/d   2.80 c/b  12.83 i/b   4.58 i/c
parse_uint8_fastswar                     :   1.04 GB/s  402.6 Ma/s   2.48 ns/d   3.20 GHz   7.94 c/d  36.01 i/d   3.08 c/b  13.99 i/b   4.54 i/c
parse_uint8_lut                          :   1.17 GB/s  456.5 Ma/s   2.19 ns/d   3.20 GHz   7.00 c/d  32.01 i/d   2.72 c/b  12.44 i/b   4.57 i/c
parse_uint8_swar                         :   0.87 GB/s  337.5 Ma/s   2.96 ns/d   3.20 GHz   9.47 c/d  43.01 i/d   3.68 c/b  16.71 i/b   4.54 i/c
parse_uint8_fromchars                    :   0.37 GB/s  143.4 Ma/s   6.97 ns/d   3.19 GHz  22.27 c/d  57.89 i/d   8.65 c/b  22.50 i/b   2.60 i/c
parse_uint8_naive                        :   0.54 GB/s  210.5 Ma/s   4.75 ns/d   3.19 GHz  15.17 c/d  44.95 i/d   5.90 c/b  17.46 i/b   2.96 i/c
```
  As you can see, your code is running at about half the speed as parse_uint8_fastswar_bob and parse_uint8_lut. In fact, it has the same performance as parse_uint8_naive in my tests.
  
  As for the criticism that the code posted on my blog is not ready for production, please see my terms of use. It is by design. The code presented in my blog posts is not meant to be copied and pasted into your projects.
Jeroen Koekkoek says:

January 2, 2024 at 8:50 pm

We can go faster if we limit dependencies(?) The idea is to ignore trash bytes and not shift them off. Given correct input, this is considerably faster on my machine. Obviously, it lacks proper error checking, but I think it’s an interesting enough approach to get input from others(?)

__attribute__((noinline)) static int parse_int8(const char *str, size_t len, uint8_t *num) { const uint32_t shr = ((len << 3) - 8) & 0x18; uint32_t dgts;
memcpy(&dgts, str, sizeof(dgts)); dgts &= 0x000f0f0flu; *num = (uint8_t)((0x640a01 * dgts) >> shr);
return 1; }
1. Daniel Lemire says:
  
  January 2, 2024 at 9:13 pm
  
  It is about 25% faster than the fastest approach, when using random inputs.