3rd July 2017, 13 min read

Pruning spaces from strings quickly on ARM processors

19 thoughts on “Pruning spaces from strings quickly on ARM processors”

Martins Mozeiko says:

July 4, 2017 at 3:56 am

Here’s branchless NEON version using VTBL1 instriction: https://gist.github.com/mmozeiko/be2e8afdf5a0a82b7dbdc8f013abfb5f

On my Raspberry Pi 3 (AArch64, Cortex-A53) benchmark compiled with clang 4.0.1 shows that this variant is a tiny bit faster than neon_despace:

despace(buffer, N) : 6.81 ns per operation
neon_despace(buffer, N) : 4.15 ns per operation
neon_despace_branchless(buffer, N) : 4.09 ns per operation
1. Daniel Lemire says:
  
  July 4, 2017 at 7:47 pm
  Thanks. Currently, I get the following:
```
despace(buffer, N)                      :  1.40 ns per operation
neon_despace(buffer, N)                 :  1.07 ns per operation
neon_despace_branchless(buffer, N)      :  3.81 ns per operation
```
  So it is not great.
  1. Daniel Lemire says:
    
    July 4, 2017 at 7:49 pm
    Update: This was gcc. If I compile with clang, I get the following:
```
despace(buffer, N)                      :  1.40 ns per operation
neon_despace(buffer, N)                 :  1.04 ns per operation
neon_despace_branchless(buffer, N)      :  1.01 ns per operation
```
    I don’t think that the difference between 1.04 and 1.01 is actually significant, but it still a nice result.
    1. Martins Mozeiko says:
      
      July 5, 2017 at 1:50 am
      
      I updated code to do only one VTBL1 instruction on AArch64 (based on information from Sam’s comment).
      
      Now the code is reasonably faster than “neon_despace” variant:
      despace(buffer, N) : 6.65 ns per operation
      neon_despace(buffer, N) : 4.00 ns per operation
      neon_despace_branchless(buffer, N) : 3.54 ns per operation
      1. Cyril Lashkevich says:
        
        July 5, 2017 at 2:47 pm
        
        But result is incorrect. The table for vqtblq1 should be calculated in different way.
Alex @ thedetectorist says:

July 4, 2017 at 7:01 am

Wow, impressive work, that’s blazing fast even if it’s an ARM. How did you figure this out?
Bryan O'Donoghue says:

July 4, 2017 at 9:05 am

Interesting post – and good analysis on the # of cycles.

It would be very interesting to see a power comparison – after all if a 95w x86 takes 1.2 cycles to process something but a 20w ARM takes 2.4 cycles – the x86 is eating huge amounts of power to achieve its result.

What would the table look like if we broke it down using the # of cycles and the performance per watt ?

My guess is you’d find the ARM beating the x86 significantly.
1. matthew says:
  
  July 4, 2017 at 9:58 pm
  
  There are 4.5W Kaby Lake CPUs out there which have SSE4, so I don’t think you’re going to get much joy by normalising for power.
  
  I think this is just an instance where x86 has an instruction that ARM is missing. It’s not really fundamental to the architecture of either CPU.
Cyril Lashkevich says:

July 4, 2017 at 2:11 pm

is_not_zero can be implemented in one instruction on arm64.

static inline uint16_t is_not_zero(uint8x16_t v) {
return vaddlvq_u8(v);
}
1. Daniel Lemire says:
  
  July 4, 2017 at 7:44 pm
  
  Thanks. It seems that vaddlvq_u8 might be slower.
  1. Cyril Lashkevich says:
    
    July 5, 2017 at 12:11 am
    
    It’s useful when you need count of 0xff in vector. It’s (vaddlvq_u8(v) >> 8) + 1.
    1. Daniel Lemire says:
      
      July 5, 2017 at 12:29 am
      
      Yes. Important observation.
      
      Wouldn’t you prefer vaddlvq_u8(vorrq_u8(v,vdupq_n_u8(1)))? It is the same number of instructions and the movi call in a tight loop could get optimized away.
      1. Cyril Lashkevich says:
        
        July 5, 2017 at 2:16 pm
        
        Depends on context. For example here using vaddlvq_u8 improves performance a bit. https://gist.github.com/notorca/c731fc6a916849c3be4f4a8f55b3c583
        
        Cortex A53 (pine64):
        despace(buffer, N) : 3.48 ns per operation
        neon_despace(buffer, N) : 2.11 ns per operation
        neon_despace_branchless(buffer, N) : 2.04 ns per operation
        neon_despace_branchless_my(buffer, N) : 1.83 ns per operation
        
        iPhone SE:
        pointer alignment = 32768 bytes
        memcpy(tmpbuffer,buffer,N) : 0.093 ns per operation
        
        despace(buffer, N) : 0.79 ns per operation
        neon_despace(buffer, N) : 0.64 ns per operation
        neon_despace_branchless(buffer, N) : 0.43 ns per operation
        neon_despace_branchless_my(buffer, N) : 0.40 ns per operation
        
        Daniel Lemire says:
        
        July 5, 2017 at 8:32 pm
        
        On the processor I am using, vqmovn_u64 (uqxtn) has a latency of 4 and a throughput of 1; meanwhile uaddlv has a latency of 8 (over 8 bit values) and the same throughput. So vqmovn_u64 (uqxtn) is preferable I think.
        
        Most other operations, such as a shift by an immediate (ushr) or a bitwise OR (vorr) has a latency of three. So something like vaddlvq_u8(vshrq_n_u8(v,7)) ends up having a latency of 8+3= 11 cycles.
      2. Daniel Lemire says:
        
        July 5, 2017 at 7:58 pm
        
        Or vaddlvq_u8(vshrq_n_u8(v,7))?
Sam says:

July 4, 2017 at 6:27 pm

Slight correction: In AArch64 the TBL instruction can shuffle 16B at once, this is exposed in the intrinsics with vqtblq.
1. Daniel Lemire says:
  
  July 5, 2017 at 7:59 pm
  
  Yes it does, e.g., vqtbl1q_u8. My mistake.
Derek Ledbetter says:

July 7, 2017 at 9:21 pm

Your GitHub code has a mistake. The `vld4q_u8` intrinsic will interleave the words. For instance, the first vector will have characters { 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51 }. The correct method is to use `vld1q_u8` four times, incrementing the address by 16 each time.
1. Daniel Lemire says:
  
  July 7, 2017 at 9:23 pm
  
  Good catch!