10th July 2017, 13 min read

Pruning spaces faster on ARM processors with Vector Table Lookups

16 thoughts on “Pruning spaces faster on ARM processors with Vector Table Lookups”

Cyril Lashkevich says:

July 10, 2017 at 5:31 pm

Great work! In the future ARM Scalable Vector Extension there is prefect instruction ‘COMPACT’ which “Read the active elements from the source vector and pack them into the lowest-numbered elements of the destination vector. Then set any remaining elements of the destination vector to zero.” This instruction will make shufmask unneeded. https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
1. Daniel Lemire says:
  
  July 10, 2017 at 5:41 pm
  
  Great work!
  
  Thanks. I have not made any attempt to optimize the code, beyond writing something that I can understand and that is likely to be correct. So it seems likely we can do even better.
  
  This instruction will make shufmask unneeded.
  
  Are you sure? Some AVX-512 instruction sets have compress instructions that do something similar, but they compress 32-bit words, not bytes. So I’d be interested in verifying that the documentation refers to the application of COMPACT to bytes.
  1. Cyril Lashkevich says:
    
    July 10, 2017 at 6:11 pm
    
    You are right, COMPACT works with words and doublewords only. But it still can be used, expand 2 times, compact, than narrow 2 times.
    1. Daniel Lemire says:
      
      July 10, 2017 at 6:18 pm
      
      But it still can be used (…)
      
      Of course, the only way to know if it is practical is to write the code and test it out on actual hardware, but I don’t think I have any hardware for it… Do we know when that will be available?
      1. Cyril Lashkevich says:
        
        July 10, 2017 at 6:29 pm
        
        Yes, it would be interesting to experiment with such HW. I hope annual update of iPhones brings us it.
    2. Sam Lee says:
      
      July 11, 2017 at 5:21 pm
      
      Full disclosure, I’m a graduate at ARM (and I’m not commenting on behalf of ARM in any way)
      
      In SVE, the new SPLICE instruction will be able to act on bytes and should cover this benchmark nicely (again performance will be implementation dependent, so we shall see how that goes):
      “Splice two vectors under predicate control. Copy the first active to last active elements (inclusive) from the first source vector to the lowest-numbered elements of the result. Then set any remaining elements of the result to a copy of the lowest-numbered elements from the second source vector. The result is placed destructively in the first source vector.”
      
      So in SVE this should boil down to 5 instructions per vector (interleaved as appropriate to hide latencies):
      LD1B //load contiguous vector
      CMPGT //set a predicate to 1 where non-white and 0 where whitespace
      SPLICE //group non-white characters in bottom of vector (we don’t care what happens at the top)
      ST1B //store contiguous vector
      INCP //increment pointer by number of non-white characters (using predicate)
      
      (You can have a look at what’s coming in more detail if you check the XML files from the zip in the link Cyril pointed to)
      1. Daniel Lemire says:
        
        July 11, 2017 at 5:32 pm
        
        Ah yes. So it is like Intel’s Parallel Bits Extract, except that it is for bytes.
        
        That would be wonderful.
      2. wmu says:
        
        July 15, 2017 at 7:20 pm
        
        Sam, is there any ARM emulator that works like Intel Software Emulator? I mean one can run their compiled program using selected instruction set, and thanks to that would be able to test at least correctness of implementation for upcoming architectures.
        
        Daniel Lemire says:
        
        July 17, 2017 at 3:53 pm
        
        The answer is apparently positive, you can run ARM SVE through an emulator:
        
        https://developer.arm.com/products/software-development-tools/hpc/documentation/running-sve-code-with-arm-instruction-emulator
        
        Sadly, I could not find the emulator itself.
Cyril Lashkevich says:

July 10, 2017 at 6:05 pm

Btw size of table can be reduced 2 times, because row_n+1 == row_n <> 1));
if (index & 1) {
shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);
}

2. remove all even lines form shufmask, replace last unused values by zero and load shuf like this:
uint16_t index = neonmovemask_addv(w0);
uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1) – index &1);

In first case there is additional instruction and branch, in second access to unaligned memory. In fact indexes in shufmask are 4-bit, and the table can be compressed 2 times more, but unpacking will require 1 vector multiplication and 1 vector and.
1. Cyril Lashkevich says:
  
  July 10, 2017 at 6:14 pm
  
  Seems parser ate part of my comment 🙁 I have to use LSL for logical shift left
  row_n+1 == row_n LSL 8
  1. remove all even lines form shufmask, and calculate shuf like this.
  uint16_t index = neonmovemask_addv(w0);
  uint8x16_t shuf0 = vld1q_u8(shufmask + 16 * (index >> 1));
  if (index & 1) {
  shuf0 = vextq_u8(vdupq_n_u8(0), shuf, 1);
  }
  1. Daniel Lemire says:
    
    July 10, 2017 at 6:41 pm
    
    I think you were clear enough.
    
    My guess is that adding a branch to save memory might often be a negative. My current benchmark leaves us with an “easy to predict” branch, so my guess is that if were to implement it, we would not see a performance difference… however, this could degenerate in other, harder benchmarks.
    
    Your other change is more likely to be beneficial generally speaking. Not that it will be faster, but it will cut in the size of the binary.
    
    We could do a lot better by replacing the 16-bit lookup with two 8-bit lookups, but it might double the number of instructions…
Derek Ledbetter says:

July 16, 2017 at 10:02 pm

Here’s my attempt. Like your newest method, I construct a bit mask recording whether each of the 8 characters in a block passed or failed the test, and then using vtbl to extract the correct characters and write them with a single instruction. But I didn’t want to use a lookup table.

I couldn’t find a simple way to construct the vtbl indices all at once, so I decided to flip the problem around. I do 16 8-character blocks at a time, and I construct the vtbl indices by doing the same operation 8 times, and then I do three rounds of zipping to put them in the correct order.

In each of the 8 steps, I find the location of the rightmost set bit by computing popcount((b – 1) & ~b), and then I clear that bit by doing b &= b – 1.

But it turns out to be more than twice as slow as your giant look-up table. On an iPhone 5s, in ns per operation:
despace: 1.28
neon_despace: 1.04
neon_despace_branchless: 0.64
neontbl_despace: 0.24
neon_interleaved_despace (my function): 0.58

I also wrote a simple test app for iOS. I posted all of this at GitHub.
https://github.com/DerekScottLedbetter/space-pruner

I have a new idea for computing the vtbl indices, but it probably won’t beat the look-up table.
1. Daniel Lemire says:
  
  July 17, 2017 at 3:45 pm
  
  That sounds very impressive.
2. Derek Ledbetter says:
  
  July 20, 2017 at 11:51 pm
  
  I found a method for taking an integer and separating alternate set bits. I use NEON’s polynomial multiplication feature and multiply the 8-bit integer by 0xFF, then AND the original with the product to get the 1st, 3rd, 5th, â€¦ set bits, and AND with the original with the complement of the product to get the 2nd, 4th, 6th, â€¦ set bits. Then I do this once more, so now I have four bytes, each with at most two set bits. Then I count the leading and trailing zeroes to get the indices of the bits.
  
  Doing this cut the time from 0.58 to 0.49. Unrolling the loop, doing 256 bytes at once, reduces the time to 0.37, compared with 0.24 using the look-up table.
  1. Daniel Lemire says:
    
    July 21, 2017 at 12:39 am
    
    Wow. I will be checking it out.