I think it’s important to note that we’re not doing traditional parsing (at least as I’ve seen it), which involves parsing from left to right (most significant to least). The thing that makes this work is that we “right-justify” the number in the SIMD register (put the least significant digit in the most significant byte of the SIMD register), .

This contrasts with traditional left-to-right number parsers: by knowing where the least significant digit of the number is, we know exactly how much each digit is worth (the rightmost is digit10^0, penultimate is digit10^1, etc…). We no longer need to do the conventional “read the next digit, multiply the total by 10, read the next digit and add it in,” which creates data dependency from digit to digit.

Because we support up to 20 digit numbers, we start with a 32 byte register (though we quickly shrink to a 16-byte register in step 2, since 2 digits can still fit in one byte):

And after that, we manually extract the 32-bit numbers, multiply them by 10^16, 10^8 and 10^0, and add them to get the 64-bit result (in this case, 1234567890).

One interesting bit to me: with AVX-512, we could also parse 2 numbers simultaneously without modifying the algorithm. (We could parse much more if we knew they were all 8 bytes or less!) If we want 32-bit numbers, you could parse 4 at a time!

Note that you could do this without knowing the full size of the number if you load the 32 bytes (assumes padding), look for non-digits, and use that to find the end, and then “right-justifying” all the digits with a byte shift/shuffle.

sasuke420says:

A commenter on another blog suggests using vpshldvw then vpdpbusd to do the first 2 multiply-and-add steps at once by computing 4a, 4b, c, d then mul-adding with 250, 25, 10, 1.

I have various thoughts about this problem from https://highload.fun/tasks/1 but that seems to be a very different problem (it is a lot about detecting the edge of the number and zeroing out the rest of the register, or finding a way to shuffle certain digits from several numbers of unknown lengths into fixed locations in multiple registers).

John Keisersays:Oh nice, thanks for fleshing this one out!

I think it’s important to note that we’re not doing traditional parsing (at least as I’ve seen it), which involves parsing from left to right (most significant to least). The thing that makes this work is that we “right-justify” the number in the SIMD register (put the least significant digit in the most significant byte of the SIMD register), .

This contrasts with traditional left-to-right number parsers: by knowing where the least significant digit of the number is, we know exactly how much each digit is worth (the rightmost is digit

10^0, penultimate is digit10^1, etc…). We no longer need to do the conventional “read the next digit, multiply the total by 10, read the next digit and add it in,” which creates data dependency from digit to digit.Because we support up to 20 digit numbers, we start with a 32 byte register (though we quickly shrink to a 16-byte register in step 2, since 2 digits can still fit in one byte):

`String: 1234567890`

SIMD step 1 (8-bit x 32): 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0

SIMD step 2 (8-bit x 16): 00 00 00 00 00 00 00 00 00 00 00 12 34 56 78 90

SIMD step 3 (16-bit x 8): 0000 0000 0000 0000 0000 0012 3456 7890

SIMD step 4 (32-bit x 4): 00000000 00000000 00000012 34567890

And after that, we manually extract the 32-bit numbers, multiply them by 10^16, 10^8 and 10^0, and add them to get the 64-bit result (in this case, 1234567890).

One interesting bit to me: with AVX-512, we could also parse 2 numbers simultaneously without modifying the algorithm. (We could parse much more if we knew they were all 8 bytes or less!) If we want 32-bit numbers, you could parse 4 at a time!

Note that you could do this without knowing the full size of the number if you load the 32 bytes (assumes padding), look for non-digits, and use that to

findthe end, and then “right-justifying” all the digits with a byte shift/shuffle.sasuke420says:A commenter on another blog suggests using vpshldvw then vpdpbusd to do the first 2 multiply-and-add steps at once by computing 4a, 4b, c, d then mul-adding with 250, 25, 10, 1.

I have various thoughts about this problem from https://highload.fun/tasks/1 but that seems to be a very different problem (it is a lot about detecting the edge of the number and zeroing out the rest of the register, or finding a way to shuffle certain digits from several numbers of unknown lengths into fixed locations in multiple registers).

eden segalsays:In here

https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html#comment-6285552965

The blogger talks about this problem too, which is pretty similar to your discussion, but please look at the comment by sharp-o which describes how to use vnni to simplify the byte to 32 bit reduction, after which you need a regular mul and horizontal add.

Horizontal add of u32 can be done using _mm_mpsadbw_epu8 in a new method you probably heard about, but I can’t find a link.

sasuke420says:This horizontal add technique sounds quite interesting!

Subhajit Sahusays:Thanks a lot, this gives awesome performance. I didnt need the non-digit check, so i skipped it.