Daniel Lemire's blog

, 13 min read

Pruning spaces faster on ARM processors with Vector Table Lookups

16 thoughts on “Pruning spaces faster on ARM processors with Vector Table Lookups”

  1. Cyril Lashkevich says:

    Great work! In the future ARM Scalable Vector Extension there is prefect instruction ‘COMPACT’ which “Read the active elements from the source vector and pack them into the lowest-numbered elements of the destination vector. Then set any remaining elements of the destination vector to zero.” This instruction will make shufmask unneeded. https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

    1. Great work!

      Thanks. I have not made any attempt to optimize the code, beyond writing something that I can understand and that is likely to be correct. So it seems likely we can do even better.

      This instruction will make shufmask unneeded.

      Are you sure? Some AVX-512 instruction sets have compress instructions that do something similar, but they compress 32-bit words, not bytes. So I’d be interested in verifying that the documentation refers to the application of COMPACT to bytes.

      1. Cyril Lashkevich says:

        You are right, COMPACT works with words and doublewords only. But it still can be used, expand 2 times, compact, than narrow 2 times.

        1. But it still can be used (…)

          Of course, the only way to know if it is practical is to write the code and test it out on actual hardware, but I don’t think I have any hardware for it… Do we know when that will be available?

          1. Cyril Lashkevich says:

            Yes, it would be interesting to experiment with such HW. I hope annual update of iPhones brings us it.

        2. Sam Lee says:

          Full disclosure, I’m a graduate at ARM (and I’m not commenting on behalf of ARM in any way)

          In SVE, the new SPLICE instruction will be able to act on bytes and should cover this benchmark nicely (again performance will be implementation dependent, so we shall see how that goes):
          “Splice two vectors under predicate control. Copy the first active to last active elements (inclusive) from the first source vector to the lowest-numbered elements of the result. Then set any remaining elements of the result to a copy of the lowest-numbered elements from the second source vector. The result is placed destructively in the first source vector.”

          So in SVE this should boil down to 5 instructions per vector (interleaved as appropriate to hide latencies):
          LD1B //load contiguous vector
          CMPGT //set a predicate to 1 where non-white and 0 where whitespace
          SPLICE //group non-white characters in bottom of vector (we don’t care what happens at the top)
          ST1B //store contiguous vector
          INCP //increment pointer by number of non-white characters (using predicate)

          (You can have a look at what’s coming in more detail if you check the XML files from the zip in the link Cyril pointed to)

          1. Ah yes. So it is like Intel’s Parallel Bits Extract, except that it is for bytes.

            That would be wonderful.

          2. wmu says:

            Sam, is there any ARM emulator that works like Intel Software Emulator? I mean one can run their compiled program using selected instruction set, and thanks to that would be able to test at least correctness of implementation for upcoming architectures.

            1. The answer is apparently positive, you can run ARM SVE through an emulator:


              Sadly, I could not find the emulator itself.

  2. Cyril Lashkevich says:

    Btw size of table can be reduced 2 times, because row_n+1 == row_n <> 1));
    if (index & 1) {
    shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);

    2. remove all even lines form shufmask, replace last unused values by zero and load shuf like this:
    uint16_t index = neonmovemask_addv(w0);
    uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1) – index &1);

    In first case there is additional instruction and branch, in second access to unaligned memory. In fact indexes in shufmask are 4-bit, and the table can be compressed 2 times more, but unpacking will require 1 vector multiplication and 1 vector and.

    1. Cyril Lashkevich says:

      Seems parser ate part of my comment 🙁 I have to use LSL for logical shift left
      row_n+1 == row_n LSL 8
      1. remove all even lines form shufmask, and calculate shuf like this.
      uint16_t index = neonmovemask_addv(w0);
      uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1));
      if (index & 1) {
      shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);

      1. I think you were clear enough.

        My guess is that adding a branch to save memory might often be a negative. My current benchmark leaves us with an “easy to predict” branch, so my guess is that if were to implement it, we would not see a performance difference… however, this could degenerate in other, harder benchmarks.

        Your other change is more likely to be beneficial generally speaking. Not that it will be faster, but it will cut in the size of the binary.

        We could do a lot better by replacing the 16-bit lookup with two 8-bit lookups, but it might double the number of instructions…

  3. Derek Ledbetter says:

    Here’s my attempt. Like your newest method, I construct a bit mask recording whether each of the 8 characters in a block passed or failed the test, and then using vtbl to extract the correct characters and write them with a single instruction. But I didn’t want to use a lookup table.

    I couldn’t find a simple way to construct the vtbl indices all at once, so I decided to flip the problem around. I do 16 8-character blocks at a time, and I construct the vtbl indices by doing the same operation 8 times, and then I do three rounds of zipping to put them in the correct order.

    In each of the 8 steps, I find the location of the rightmost set bit by computing popcount((b – 1) & ~b), and then I clear that bit by doing b &= b – 1.

    But it turns out to be more than twice as slow as your giant look-up table. On an iPhone 5s, in ns per operation:
    despace: 1.28
    neon_despace: 1.04
    neon_despace_branchless: 0.64
    neontbl_despace: 0.24
    neon_interleaved_despace (my function): 0.58

    I also wrote a simple test app for iOS. I posted all of this at GitHub.

    I have a new idea for computing the vtbl indices, but it probably won’t beat the look-up table.

    1. That sounds very impressive.

    2. Derek Ledbetter says:

      I found a method for taking an integer and separating alternate set bits. I use NEON’s polynomial multiplication feature and multiply the 8-bit integer by 0xFF, then AND the original with the product to get the 1st, 3rd, 5th, … set bits, and AND with the original with the complement of the product to get the 2nd, 4th, 6th, … set bits. Then I do this once more, so now I have four bytes, each with at most two set bits. Then I count the leading and trailing zeroes to get the indices of the bits.

      Doing this cut the time from 0.58 to 0.49. Unrolling the loop, doing 256 bytes at once, reduces the time to 0.37, compared with 0.24 using the look-up table.

      1. Wow. I will be checking it out.