Daniel Lemire's blog

, 58 min read

Parsing 8-bit integers quickly

67 thoughts on “Parsing 8-bit integers quickly”

  1. D. Bohdan says:

    Neat! I have noticed a small mistake:

    memcpy(&digits.as_int, str, sizeof(digits));

    should be

    memcpy(&digits.as_str, str, sizeof(digits));

    1. Christopher Sahnwaldt says:

      I think you’re right. Currently, the code doesn’t use the as_str field at all. Why do even need the digits union? Am I missing something?

    2. dnnfndmsjs says:

      Nope. The whole trick with enum here is UB. Compiler is allowed to optimize it the way it wants. The proper solution is std::bit_cast. And memcpy is a C function that is not a constexpr, while std::bit_cast while is, allowing to make the whole function a constexpr.

  2. This line, at the very least, means the code only works for little endian machines, right:

    digits.as_int <<= (4 – (len & 0x3)) * 8;

    1. For Big Endian, you just need to reverse the bytes, which is typically just one instruction.

      Big endian systems are vanishingly rare.

      1. Does the C language expression for such consistently map to just one instruction?

        The rarity of big endian is no excuse to disregard it in a supposedly high-level language.

        1. ShinyHappyREM says:

          > The rarity of big endian is no excuse to disregard it in a supposedly high-level language

          Do you also consider systems that have less/more than 8 bits in a byte or machine word? What about systems that don’t use two’s complement?

          1. Yes, I do, and it can be a massive pain.

        2. Please see the first rule on my terms of use: https://lemire.me/blog/terms-of-use/

          I copied and paste code form your blog post and it does not compile, it contained a massive security hole, it contained a terrible bug or it crashed my computer. Can I blame you?
          No. Lemire’s rule: Blogging is literature, not engineering. Code taken from a blog post should not be expected to work, it is meant to illustrate an idea. Don’t build production systems by copying and pasting random code from the Internet. It will not end well.

          I do build a lot of code that is meant to be deployed in production systems. The code used on my blog post is not meant for that. I do not run tests on mainframe platforms, for example.

          1. That’s fair enough. I wasn’t trying to be a jerk.

      2. Jean-Marc Bourguet says:

        Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?

        Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?

        1. Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?

          It is almost surely not sufficient. Can you prove that it is?

          Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?

          Can you share your full code? I’d be happy to benchmark it.

          1. Jean-Marc Bourguet says:

            Now that I’ve had time to look at it, the issue with simple multiplication for big endian is there will be unwanted carry. So it doesn’t work.

            Combining the two shifts is essentially the same idea as KWillets and suffers the same issue of validation.

            On the other hand, if I’m not confused again, you can use

            uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;

            which is simpler but doesn’t reduces the depth of the chain of dependent instructions, so I fear any gain will depend on the micro-architecture.

            And you could use

            *num = (uint8_t)((UINT32_C(0x640a01) * digits.as_int) >> 24);

            which probably doesn’t change anything on 64-bit computers but need only 32 bits x 32 bits -> 32 bits and thus could be useful on non-64-bit machines.

            1. Yep. I have applied your proposals, see the credit section of the blog post. I will also mention you in the code.

  3. D. Bohdan says:

    Sorry, never mind what I wrote about a mistake. I need sleep.

  4. M says:

    Isn’t “y < 256” comparison useless in return? Because value assigned to y is always and’ed with 0xff.

    1. Thanks. It was a small mistake.

    2. D. Bohdan says:

      I wondered whether and how much removing the comparison would speed up parsing. It turns out, a fair bit. On my machine (x86_64, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) the benchmark goes from

      parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d

      to

      parse_uint8_fastswar : 1.64 GB/s 638.6 Ma/s 1.57 ns/d

      So GCC didn’t optimize it away.

  5. KWillets says:

    If I’ve got my arithmetic right (or left in this case) the lower 3 bytes of 640a01 * (the unshifted digits) will contain three quantities, in little-endian byte order:

    the first digit times one
    the first digit times ten plus the second digit
    the first digit times 100 plus the second digit times 10 plus the third digit.

    Pick which one to output from the length argument.

    This technique also tolerates cruft on the end, so you can unconditionally load 3-4 bytes and still pull the correct value even if bytes to the left overflow (eg “9” maps to 9/90/900, but the 900 doesn’t overflow into the part we want).

    1. Sounds brilliant. If I can make it work, I will update with credit to you.

      1. The idea works but I am not sure it could be faster.

        1. KWillets says:

          I see; there are a lot of validation cycles in there that I didn’t consider.

          One thought on validation is that a well-formed ASCII integer will have the intermediate lanes bounded by 9/99/256 up to the output lane (eg “91” will have the first two lanes bounded but 910 > 256 in the third). A non-digit input should fail in one of the lanes (using 0/0/0 lower bound).

        2. D. Bohdan says:

          It looks like parsing the way KWillets suggests then indexing the result is faster than parse_uint8_swar but slower than parse_uint8_fastswar.

          int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
          union {
          uint8_t as_str[4];
          uint32_t as_int;
          } digits;

          memcpy(&digits.as_int, str, sizeof(digits));
          digits.as_int ^= 0x30303030;
          digits.as_int *= 0x00640a01;
          *num = digits.as_str[(len & 0x3) - 1];
          return (digits.as_int & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
          }

          ---

          volume 25680 bytes
          parse_uint8_swar : 1.04 GB/s 404.3 Ma/s 2.47 ns/d
          parse_uint8_fastswar : 1.51 GB/s 586.4 Ma/s 1.71 ns/d
          parse_uint8_fastswar_tweak : 1.23 GB/s 478.7 Ma/s 2.09 ns/d
          parse_uint8_fromchars : 0.48 GB/s 187.8 Ma/s 5.32 ns/d
          parse_uint8_naive : 0.65 GB/s 252.8 Ma/s 3.96 ns/d

          If you replace indexing with a bit shift, you get performance comparable to parse_uint8_fastswar in a simpler function.

          int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
          uint32_t digits;

          memcpy(&digits, str, sizeof(digits));
          digits ^= 0x30303030;
          digits *= 0x00640a01;
          *num = (uint8_t)(digits >> (8 * ((len & 0x3) - 1)));
          return (digits & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
          }

          ---

          volume 25680 bytes
          parse_uint8_swar : 1.04 GB/s 404.8 Ma/s 2.47 ns/d
          parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d
          parse_uint8_fastswar_tweak : 1.50 GB/s 584.0 Ma/s 1.71 ns/d
          parse_uint8_fromchars : 0.48 GB/s 186.8 Ma/s 5.35 ns/d
          parse_uint8_naive : 0.64 GB/s 249.5 Ma/s 4.01 ns/d

  6. walter says:

    Typo: “but it was only about 40% than the naive approach”

    1. Thanks.

  7. Luni says:

    Are you not allowed to use a lookup table?

    if len==1, do the ASCII conversion
    else if len == 2, append two digits into a 16-bit int, subtract 0x0300 to convert the first byte and lookup in a 2560 item sparsely filled table
    else live with a very big table or do the ASCI conversion on the first byte, multiply by 100 and add the len == 2 conversion.

    1. Yes. We can use a lookup table, if only for fun. Do you have an implementation?

  8. Samuel Lee says:

    Hmm, doesn’t the SWAR method give the wrong return value given an input string like “12>”? The validation of only having ASCII digits seems to be missing a validation of 0x3A-0x3F.

    Also there is possibly room to be faster by shifting the result of the multiplication to the right by a variable amount (based on len), rather than shifting the masked input to the left by a variable amount; the computation of the shift amount is more likely to be able to be done in parallel with other work this way.
    Also the shift amount can be computed with len directly, rather than (len & 0x3), given that the validity of the result is indicated by the return value (i.e we don’t care about the shift amount when len>3).

    1. The current version should provide complete validation although you are correct that the initial code used in this blog post provided only partial validation.

  9. Jylam says:

    First, great work.

    However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior

    1. However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior.

      The C++14 and C17 standards make it defined behaviour. Here is the specification:

      The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 x 2^E2, reduced modulo one more than the maximum value representable in the result type.

      1. Jylam says:

        Yes, as you maybe saw on twitter, I was referring to C99 (this behaviour is the same in C89 IIRC). Anyway, better be safe, and in all cases it’ll return early if len==0, so that’s that.

  10. Greg says:

    The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library.

    Indeed. Part of the reason is the “naive” function has a fixed upper bound on the number of loop iterations (the `r= len & 0x3 clause). As a result (and I checked in compiler explorer), the compiler always unrolls the loop. std::from_chars allows an unbounded number of leading zeros, so it can’t unconditionally unroll the loop (without having a suffix afterwards, I guess).

    This doesn’t explain all of the performance difference, but, as usual, often getting performance out of code is being very precise about what the actual requirements for the inputs and outputs are.

  11. Marcin Zukowski says:

    Cool post, pure fun and a few neat tricks 🙂

    Two comments:

    If you want to make this routine “safe” (and hence the comparison really fair), it’s enough to check if the input crosses a page boundary, and take a slow path if it is (or have a SWAR version reading from “the other side”). That should be very branch-predictor safe, so hopefully wouldn’t slow things down much. Some hash functions use this approach.
    The benchmarks give some advantage to SWAR considering you’re using a uniform distribution in range 0..255, making the average string length 2.23. There might be use cases where the expected number of digits is smaller, and then SWAR might be less beneficial or even slower.

    But I’m nitpicking here 🙂

      1. The first version of our fast JSON parser (simdjson) was page-boundary aware… but this gave us much trouble because people using sanitizers or valgrind would report bugs. I tried arguing back but it became tiresome. So the current version of simdjson use padded strings. It works well. I have a friend who writes string functions for a runtime library, but they get their code to be excluded from valgrind checks and sanitizers… so people don’t see anything… The use case for the function in this blog post is simdzone (parsing zone files for DNS systems) in which case the author knows that the input is padded.

      2. The SWAR version is faster even in the case where we have a sequential input. So it has good chances of being faster in practice. Still people should benchmark on the data they care about!

      1. KWillets says:

        Is regular old SIMD on the menu? The byte range-checking would certainly be easier, and shuffling of quartets etc.

        1. I actually think SIMD might be more practical.

          But one has to have a good SWAR implementation to compare with!!!

  12. Michael Dunphy says:

    Hi,

    Since reading up to 4 bytes from str is OK, here is another non-SWAR implementation that runs equally fast for the random and sequential cases:

    int parse_uint8_naive_md(const char *str, size_t len, uint8_t *num) {
    if (--len > 2) return 0;
    static const uint8_t sf[] = {0,0,1,10,100}; // scale factor
    uint64_t d1, d2, d3, s1, s2, s3, n;
    d1 = (uint64_t)str[0] - (uint64_t)'0';
    d2 = (uint64_t)str[1] - (uint64_t)'0';
    d3 = (uint64_t)str[2] - (uint64_t)'0';
    s1 = (uint64_t)sf[len+2];
    s2 = (uint64_t)sf[len+1];
    s3 = (uint64_t)sf[len];
    n = s1*d1 + s2*d2 + s3*d3;
    *num = (uint8_t)n;
    return n < 256 && d1<10 && d2<10 && d3<10;
    }

    1. Interesting approach, but I think it does not work as is.

      What if the input is a digit followed by random bytes, and the len is set to 1? Your return value depends on std[1] and str[2] which can be garbage.

      1. Michael Dunphy says:

        Good catch! At the expense of a couple more checks, it should be fixed with:

        return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);

        Meanwhile, for what it’s worth, the performance (of all of the implementations here) appears to be sensitive to the mixture of inputs from the [0,255] range and the [256,999] range. It must be primarily due to the short circuiting of the various terms in the return value expressions. (The quick test is to change val to uint16_t and change %256 to %1000 in the benchmarker; otherwise the inputs are all <256 and we don’t need that validation…).

        Presumably the performance will also depend on the mixture of invalid inputs as well (cases where len=0, len>3, or non digit chars).

        1. The fast function from the blog post should compile to branchless, except for the length parameter, see the assembly in the blog post. If you use lengths that outside the allowed range ([1,3]) then the performance can vary… But otherwise, it should be flat. The function’s performance is not data dependent. You can see in my table of results.

          Make sure your build your code in release mode with full optimizations.

          1. Michael Dunphy says:

            I agree that the fastswar performance is flat between sequential and random. I am finding that using values in [0,999] is a bit faster (about +13%, the mixture is approx 25%/75% valid/invalid), and even faster for values in [256,999] (about +18%, which 100% invalid values). The fastswar return statement uses & and not && so indeed that does not short-circuit… so I’m not sure what is the source of that extra performance for inputs >255.

            1. Michael Dunphy says:

              Scratch that; I was looking at the wrong metric (GB/s)

        2. I don’t think that the following is sufficient because you don’t know the state of s2 and s3, at least in how I interpret the benchmark.

          return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);

          1. Michael Dunphy says:

            If we get past the first line, len will be 0, 1 or 2 (decremented by 1 from the initial 1, 2 or 3). When len=0, s1,s2,s3=1,0,0, for len=1, s1,s2,s3=10,1,0 and for len=3, s1,s2,s3 = 100,10,1. So s1,s2,s3 are always assigned. If s2 is zero we don’t care about the value of d2 (so the || short circuits), but if s2 is nonzero then we do need to check d2. Similar for s3 & d3. It’s a bit contorted in pursuit of avoiding branches.

  13. Jason M says:

    I got nerdsniped by this a bit. I think your approach is getting close to optimal, it's hard to beat a constant time operation that has a small constant. Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast hash function (the key is the four byte string cast to a u32). Obviously it trades quite a bit of space for some speed. It could probably be compacted in space with a bit of tweaking. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar, as it should have a smaller constant. To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so: `
    lut[simple_hash((unsigned*)(str))];

    Lut building code follows:

    define SIZE 16384

    uint8_t lut[SIZE] = {};

    uint32_t simple_hash(uint32_t u32_value) {
    const uint32_t shift = 32;
    uint64_t hash_val = (uint64_t)u32_value * 1000000000000000000;
    hash_val = (hash_val >> 32) % SIZE;
    return (uint32_t)hash_val;
    }

    void build_lut() {
    char strings[256*4];
    memset(strings, 0, sizeof(strings));
    char *iter = strings;
    for (int i = 0; i &lt; 256; ++i) {
    sprintf(iter, &quot;%d&quot;, i);
    iter += 4;
    }
    iter = strings;
    for (int i = 0; i &lt; 256; ++i) {
    unsigned c = *(unsigned*) iter;
    iter += 4;
    unsigned idx = simple_hash(c) % SIZE;
    lut[idx] = i;
    }
    }

    1. Jason M says:

      The formatting got butchered a bit.

      I have posted a cleaned up version here:

      https://blog.loadzero.com/blog/parse-int-nerdsnipe/

  14. Bruce A MacNaughton says:

    Wonderful article – I really enjoy this kind of work. The only bad part is that I spent a bit of time implementing this in Rust and took some time to fully understand some of your “magic”.

    One minor observation – I believe that ((digits.as_int | (0x06060606 + digits.as_int)) doesn’t need the digits.as_int | component. My test suite passes fully without it.

    My rust implementation (can’t seem to get the formatting quite right):`

    fn make_u8(s: &str) -> Option<u8> {
    if s.is_empty() || s.len() > 3 {
    return None;
    }
    let bytes = s.as_bytes();

    // using a union avoids branching on the length to initialize each byte
    // of the u32 interpretation.
    let mut working = unsafe {
    #[repr(C)]
    union U {
    bytes: [u8; 4],
    num: u32,
    }
    // could use uninit here to avoid initialization...
    let mut u = U { num: 0 };
    u.bytes[..s.len()].copy_from_slice(&bytes[..s.len()]);
    u.num
    };

    working ^= 0x30303030;

    working <<= (4 - s.len()) * 8;

    // Wrapping prevents panics on overflow.
    let mult = Wrapping(0x640a01) * Wrapping(working);
    // unwrap it now (could just use .0 but this is more explicit)
    let Wrapping(mult) = mult;

    let num = (mult >> 24) as u8;

    let all_digits = (0x06060606 + working) & 0xF0F0F0F0 == 0;
    let swapped = u32::from_be_bytes(working.to_le_bytes());

    if !all_digits || swapped > 0x00020505 {
    return None;
    }

    Some(num)

    }

    1. Christopher Sahnwaldt says:

      I think the digits.as_int | part is necessary to catch bad input that contains characters with byte values 0 to 9, e.g. "12\003".

      1. Christopher Sahnwaldt says:

        Oh, I was wrong. These cases are caught by 0x06060606 + digits.as_int as well.

        1. Christopher Sahnwaldt says:

          The digits.as_int | part is necessary to catch bad input that contains characters with byte values 0xCA to 0xCF. See https://github.com/jcsahnwaldt/parse_uint8_fastswar

          1. Christopher Sahnwaldt says:

            @Bruce: Your tests pass without the digits.as_int | part because they don’t actually test the byte values 0xCA to 0xCF, although they seem to test values up to 0xFF.

            The reason is that a Rust str contains UTF-8 bytes, not arbitrary bytes, and e.g. the value 0xFF gets converted to the two bytes 0xC3 and 0xBF.

            I posted an issue: https://github.com/bmacnaughton/u8-swar/issues/1

            1. Bruce A MacNaughton says:

              It fails at 0xCA

          2. Bruce A MacNaughton says:

            yes, you’re right – thank you. I ignored Unicode characters (and their encoding). I will have some additional tests to verify that (and will add the OR’d digits.as_int back in).

  15. Bruce A MacNaughton says:

    I don’t know what happened to the formatting of my previous post, so here’s a link to the code: https://github.com/bmacnaughton/u8-swar.

  16. Nick Powell says:

    I had a go at a simple LUT-based approach, which seems to be faster in most situations I’ve tested, with a few caveats: https://quick-bench.com/q/r0wfNuy0JI0ZWT893FoR0oJHpSc

    There’s no hash table here, I just reinterpreted the 3-byte string as the index into a 8MB array containing every possible result. Despite needing so much memory, only small parts of the array are actually accessed in practice (the ones containing valid values) so it doesn’t waste much cache space

    There are two versions – one where I’ve transformed the input strings in advance to pad them with nulls, so they can be directly interpreted as an integer, and one that works with the original string data by masking out the unwanted bytes. The version that uses padded strings is almost always faster by a large margin, but it’s kind of cheating so I put in the version with the mask as a more fair comparison

    Some things I’ve noticed:
    * The results vary quite a lot depending on whether you use or discard the return code. If you comment out ‘benchmark::DoNotOptimize(valid);’ from the benchmarks, the bit-twiddling functions get much faster
    * Clang seems to be much better at optimising this than GCC
    * The LUT-based approach seems to randomly vary in performance more than the others, which isn’t surprising since it relies much more on data remaining in the cache, and could be affected by other processes on the machine

    1. The results vary quite a lot depending on whether you use or discard the return code.

      That’s likely because your functions are inline-able. If you check my benchmark, you will notice that the functions are compiled separately. That’s by design: I do not want inlining as it changes the problem.

      Clang seems to be much better at optimising this than GCC

      On my test machine, I get with GCC 12 :

      parse_uint8_fastswar_bob : 1.14 GB/s 443.3 Ma/s 2.26 ns/d 3.20 GHz 7.21 c/d 33.01 i/d 2.80 c/b 12.83 i/b 4.58 i/c parse_uint8_fastswar : 1.04 GB/s 403.3 Ma/s 2.48 ns/d 3.20 GHz 7.92 c/d 36.01 i/d 3.08 c/b 13.99 i/b 4.54 i/

      And with clang 16…

      parse_uint8_fastswar_bob                 :   1.11 GB/s  430.2 Ma/s   2.32 ns/d   3.20 GHz   7.43 c/d  30.01 i/d   2.89 c/b  11.66 i/b   4.04 i/c
      parse_uint8_fastswar                     :   1.14 GB/s  444.3 Ma/s   2.25 ns/d   3.20 GHz   7.19 c/d  32.01 i/d   2.79 c/b  12.44 i/b   4.45 i/
      

      So it is mixed bag but I don’t see a large difference.

      1. Nick Powell says:

        That’s likely because your functions are inline-able.

        Ah, that explains it. Here’s the same benchmark using __attribute__((noinline)): https://quick-bench.com/q/80Z_47-AYurJITIaCzYCntYjg9A

        Also, when I run the benchmarks on my machine (linux running in WSL on a Ryzen 5800X), I get this:

        GCC 12:
        parse_uint8_fastswar_bob : 1.55 GB/s 604.2 Ma/s 1.66 ns/d
        parse_uint8_fastswar : 1.60 GB/s 625.0 Ma/s 1.60 ns/d
        parse_uint8_swar : 1.43 GB/s 559.3 Ma/s 1.79 ns/d
        parse_uint8_lut_padded : 1.71 GB/s 668.0 Ma/s 1.50 ns/d
        parse_uint8_lut_masked : 1.86 GB/s 726.0 Ma/s 1.38 ns/d
        parse_uint8_fromchars : 0.61 GB/s 237.8 Ma/s 4.21 ns/d
        parse_uint8_naive : 0.91 GB/s 355.9 Ma/s 2.81 ns/d

        Clang 15:
        parse_uint8_fastswar_bob : 1.73 GB/s 673.2 Ma/s 1.49 ns/d
        parse_uint8_fastswar : 1.70 GB/s 659.8 Ma/s 1.52 ns/d
        parse_uint8_swar : 1.23 GB/s 478.4 Ma/s 2.09 ns/d
        parse_uint8_lut_padded : 2.07 GB/s 802.9 Ma/s 1.25 ns/d
        parse_uint8_lut_masked : 1.78 GB/s 691.3 Ma/s 1.45 ns/d
        parse_uint8_fromchars : 0.72 GB/s 281.3 Ma/s 3.55 ns/d
        parse_uint8_naive : 0.96 GB/s 372.6 Ma/s 2.68 ns/d

        1. Nick Powell says:

          I also tried running the benchmark on my laptop (linux with an alder lake processor), and saw some enormous improvements over the SWAR approaches (with clang at least, I’m still not sure why GCC does so badly on my machines). -march=native also makes quite a difference, especially for fastswar_bob:

          GCC 12
          parse_uint8_fastswar_bob : 1.94 GB/s 755.2 Ma/s 1.32 ns/d
          parse_uint8_fastswar : 1.84 GB/s 713.9 Ma/s 1.40 ns/d
          parse_uint8_lut_padded : 1.96 GB/s 761.1 Ma/s 1.31 ns/d
          parse_uint8_lut_masked : 1.99 GB/s 773.1 Ma/s 1.29 ns/d

          GCC 12 with -march=native:
          parse_uint8_fastswar_bob : 1.90 GB/s 738.0 Ma/s 1.35 ns/d
          parse_uint8_fastswar : 1.85 GB/s 720.3 Ma/s 1.39 ns/d
          parse_uint8_lut_padded : 1.95 GB/s 757.7 Ma/s 1.32 ns/d
          parse_uint8_lut_masked : 1.99 GB/s 773.7 Ma/s 1.29 ns/d

          clang 15:
          parse_uint8_fastswar_bob : 1.85 GB/s 719.6 Ma/s 1.39 ns/d
          parse_uint8_fastswar : 2.19 GB/s 851.0 Ma/s 1.18 ns/d
          parse_uint8_lut_padded : 4.02 GB/s 1560.5 Ma/s 0.64 ns/d
          parse_uint8_lut_masked : 2.98 GB/s 1158.7 Ma/s 0.86 ns/d

          clang 15 with -march=native:
          parse_uint8_fastswar_bob : 2.36 GB/s 918.9 Ma/s 1.09 ns/d
          parse_uint8_fastswar : 2.41 GB/s 937.0 Ma/s 1.07 ns/d
          parse_uint8_lut_padded : 4.02 GB/s 1561.2 Ma/s 0.64 ns/d
          parse_uint8_lut_masked : 3.30 GB/s 1281.0 Ma/s 0.78 ns/d

          Here’s the fork I used: https://github.com/PowellNGL/Code-used-on-Daniel-Lemire-s-blog

          1. Added with credit and some small modifications. I can confirm that GCC is slower. It seems to use more instructions to get the job done when using a LUT (about 5 extra instructions).

  17. Christopher Sahnwaldt says:

    Here’s another Rust version:

    fn parse_uint8_fastswar(b: &[u8]) -> Option<u8> {
    if b.len() == 0 || b.len() > 3 { return None; }
    let p = b.as_ptr() as *const u32;
    let mut digits = unsafe { p.read_unaligned() };
    digits ^= 0x30303030;
    digits <<= (4 - b.len()) * 8;
    let num = ((digits.wrapping_mul(0x640a01)) >> 24) as u8;
    let all_digits = ((digits | (digits.wrapping_add(0x06060606))) & 0xF0F0F0F0) == 0;
    (all_digits && digits.swap_bytes() <= 0x020505).then_some(num)
    }

    According to https://godbolt.org/z/Ts8xrqnc7, the resulting assembly code is very similar to the C version:

    lea rax, [rsi - 4]
    cmp rax, -3
    jae .LBB3_2
    xor eax, eax
    ret
    .LBB3_2:
    mov eax, 808464432
    xor eax, dword ptr [rdi]
    neg sil
    shl sil, 3
    mov ecx, esi
    shl eax, cl
    imul edx, eax, 6556161
    shr edx, 24
    lea ecx, [rax + 101058054]
    or ecx, eax
    test ecx, -252645136
    sete cl
    bswap eax
    cmp eax, 132358
    setb al
    and al, cl
    ret

    I haven’t run benchmarks yet. I hope this might be slightly faster than C because Option<u8> is returned in two registers, so there’s no write through a pointer.

    1. Christopher Sahnwaldt says:

      See https://github.com/jcsahnwaldt/parse_uint8_fastswar. I’ll add tests and benchmarks later.

      1. Christopher Sahnwaldt says:

        The benchmark results are somewhat inconclusive. Depending on CPU (Intel Xeon, Apple M2, Amazon Graviton) and compiler (GCC, Clang):

        Sometimes Rust is faster than C, sometimes slower
        Sometimes a C function returning an option in two registers is faster than writing the result through a pointer, sometimes slower
        Sometimes fastswar is faster than fastswar_bob, sometimes slower
        Sometimes one of the fastswar_* versions is as fast as the lut version, but usually lut is the fastest

        See the fork at https://github.com/jcsahnwaldt/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28 for details.

  18. DoesNotMatter says:

    Here is the thing. Your code sucks!

    It just causes UB and after that – your code is not C/C++. It is writen for SUPER specific case (little-endian architecture, specific compiler which ignores UB, caused by reading more numbers that is allowed, etc).

    And, more than that, it is absoultely unreadable. I just have written the most stupid code I could imagine:

    static int parse_uint8_switch_case(const char *str, size_t len, uint8_t *num) {
    uint8_t hi, mid, lo;

    #define as_u8(x) ((uint8_t)((x) - '0'))

    switch(len) {
    case 1:
    *num = as_u8(str[0]);
    return *num < 10;
    case 2:
    hi = as_u8(str[0]);
    lo = as_u8(str[1]);
    *num = hi * 10 + lo;
    return (hi < 10) && (lo < 10);
    case 3:
    hi = as_u8(str[0]);
    mid = as_u8(str[1]);
    lo = as_u8(str[2]);
    *num = hi * 100 + mid * 10 + lo;
    return (hi < 10) && (mid < 10) && (lo < 10);

    default:
    return 0;
    }

    #undef as_u8
    }

    And then just launched it here

    The results on the screenshot

    So. I cannot read your code, it causes UB and after all it just slow)

    1. Here is the thing. Your code sucks!

      The blog post addresses your criticism, explicitly so:

      • Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer).
      • In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.

      You claim that there is UB, but you don’t explain why.

      As for code readability, that’s an important issue, of course. But then the blog post is specifically about optimizing for performance.

      For your benchmark, you use sequential numbers. The blog posts explains that the approach being presented is only very beneficial when the inputs are unpredictable, it says so: So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable.

      If your input is predictable, then sure, a naive implementation works. That’s already covered by the blog post.

      On my MacBook (Apple M2, LLVM14):

      volume 51473 bytes
      parse_uint8_switch_case                  :   0.76 GB/s  295.6 Ma/s   3.38 ns/d
      parse_uint8_fastswar_bob                 :   1.81 GB/s  702.8 Ma/s   1.42 ns/d
      parse_uint8_fastswar                     :   1.51 GB/s  585.4 Ma/s   1.71 ns/d
      parse_uint8_lut                          :   1.81 GB/s  702.8 Ma/s   1.42 ns/d
      parse_uint8_swar                         :   1.67 GB/s  647.8 Ma/s   1.54 ns/d
      parse_uint8_fromchars                    :   0.39 GB/s  150.9 Ma/s   6.62 ns/d
      parse_uint8_naive                        :   0.76 GB/s  294.7 Ma/s   3.39 ns/d
      

      On GCC12 with an Intel Ice Lake processor.

      parse_uint8_switch_case                  :   0.54 GB/s  211.6 Ma/s   4.73 ns/d   3.19 GHz  15.09 c/d  34.07 i/d   5.86 c/b  13.24 i/b   2.26 i/c
      parse_uint8_fastswar_bob                 :   1.14 GB/s  443.0 Ma/s   2.26 ns/d   3.20 GHz   7.21 c/d  33.01 i/d   2.80 c/b  12.83 i/b   4.58 i/c
      parse_uint8_fastswar                     :   1.04 GB/s  402.6 Ma/s   2.48 ns/d   3.20 GHz   7.94 c/d  36.01 i/d   3.08 c/b  13.99 i/b   4.54 i/c
      parse_uint8_lut                          :   1.17 GB/s  456.5 Ma/s   2.19 ns/d   3.20 GHz   7.00 c/d  32.01 i/d   2.72 c/b  12.44 i/b   4.57 i/c
      parse_uint8_swar                         :   0.87 GB/s  337.5 Ma/s   2.96 ns/d   3.20 GHz   9.47 c/d  43.01 i/d   3.68 c/b  16.71 i/b   4.54 i/c
      parse_uint8_fromchars                    :   0.37 GB/s  143.4 Ma/s   6.97 ns/d   3.19 GHz  22.27 c/d  57.89 i/d   8.65 c/b  22.50 i/b   2.60 i/c
      parse_uint8_naive                        :   0.54 GB/s  210.5 Ma/s   4.75 ns/d   3.19 GHz  15.17 c/d  44.95 i/d   5.90 c/b  17.46 i/b   2.96 i/c
      

      As you can see, your code is running at about half the speed as parse_uint8_fastswar_bob and parse_uint8_lut. In fact, it has the same performance as parse_uint8_naive in my tests.

      As for the criticism that the code posted on my blog is not ready for production, please see my terms of use. It is by design. The code presented in my blog posts is not meant to be copied and pasted into your projects.

  19. Jeroen Koekkoek says:

    We can go faster if we limit dependencies(?) The idea is to ignore trash bytes and not shift them off. Given correct input, this is considerably faster on my machine. Obviously, it lacks proper error checking, but I think it’s an interesting enough approach to get input from others(?)

    __attribute__((noinline))
    static int parse_int8(const char *str, size_t len, uint8_t *num)
    {
    const uint32_t shr = ((len << 3) - 8) & 0x18;
    uint32_t dgts;

    memcpy(&dgts, str, sizeof(dgts));
    dgts &= 0x000f0f0flu;
    *num = (uint8_t)((0x640a01 * dgts) >> shr);

    return 1;
    }

    1. It is about 25% faster than the fastest approach, when using random inputs.