Daniel Lemire's blog

, 5 min read

Parsing integers quickly with AVX-512

5 thoughts on “Parsing integers quickly with AVX-512”

  1. John Keiser says:

    Oh nice, thanks for fleshing this one out!

    I think it’s important to note that we’re not doing traditional parsing (at least as I’ve seen it), which involves parsing from left to right (most significant to least). The thing that makes this work is that we “right-justify” the number in the SIMD register (put the least significant digit in the most significant byte of the SIMD register), .

    This contrasts with traditional left-to-right number parsers: by knowing where the least significant digit of the number is, we know exactly how much each digit is worth (the rightmost is digit10^0, penultimate is digit10^1, etc…). We no longer need to do the conventional “read the next digit, multiply the total by 10, read the next digit and add it in,” which creates data dependency from digit to digit.

    Because we support up to 20 digit numbers, we start with a 32 byte register (though we quickly shrink to a 16-byte register in step 2, since 2 digits can still fit in one byte):

    String: 1234567890
    SIMD step 1 (8-bit x 32): 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0
    SIMD step 2 (8-bit x 16): 00 00 00 00 00 00 00 00 00 00 00 12 34 56 78 90
    SIMD step 3 (16-bit x 8): 0000 0000 0000 0000 0000 0012 3456 7890
    SIMD step 4 (32-bit x 4): 00000000 00000000 00000012 34567890

    And after that, we manually extract the 32-bit numbers, multiply them by 10^16, 10^8 and 10^0, and add them to get the 64-bit result (in this case, 1234567890).

    One interesting bit to me: with AVX-512, we could also parse 2 numbers simultaneously without modifying the algorithm. (We could parse much more if we knew they were all 8 bytes or less!) If we want 32-bit numbers, you could parse 4 at a time!

    Note that you could do this without knowing the full size of the number if you load the 32 bytes (assumes padding), look for non-digits, and use that to find the end, and then “right-justifying” all the digits with a byte shift/shuffle.

  2. sasuke420 says:

    A commenter on another blog suggests using vpshldvw then vpdpbusd to do the first 2 multiply-and-add steps at once by computing 4a, 4b, c, d then mul-adding with 250, 25, 10, 1.

    I have various thoughts about this problem from https://highload.fun/tasks/1 but that seems to be a very different problem (it is a lot about detecting the edge of the number and zeroing out the rest of the register, or finding a way to shuffle certain digits from several numbers of unknown lengths into fixed locations in multiple registers).

  3. eden segal says:

    In here
    The blogger talks about this problem too, which is pretty similar to your discussion, but please look at the comment by sharp-o which describes how to use vnni to simplify the byte to 32 bit reduction, after which you need a regular mul and horizontal add.

    Horizontal add of u32 can be done using _mm_mpsadbw_epu8 in a new method you probably heard about, but I can’t find a link.

    1. sasuke420 says:

      This horizontal add technique sounds quite interesting!

  4. Subhajit Sahu says:

    Thanks a lot, this gives awesome performance. I didnt need the non-digit check, so i skipped it.