23rd June 2022, 3 min read

Filtering numbers quickly with SVE on Amazon Graviton 3 processors

4 thoughts on “Filtering numbers quickly with SVE on Amazon Graviton 3 processors”

KWillets says:

June 28, 2022 at 12:51 am

Now I have to ask: how big is the vector?
1. Daniel Lemire says:
  
  June 28, 2022 at 3:26 pm
  
  Ah. It is a secret.
  
  (It appears to be 32 bytes.)
  1. Daniel Lemire says:
    
    June 28, 2022 at 3:54 pm
    
    You can actually figure it out from first principles. There are 9 instructions in the main loop…
    .LBB0_1: // =>This Inner Loop Header: Depth=1 ld1w { z0.s }, p0/z, [x0, x8, lsl #2] add x8, x10, x8 cmpge p1.s, p0/z, z0.s, #0 compact z0.s, p1, z0.s cntp x11, p0, p1.s st1w { z0.s }, p0, [x2, x9, lsl #2] add x9, x11, x9 whilelt p0.s, x8, x1 b.ne .LBB0_1
    
    I report 1.125 instructions per 32-bit words. 1.125 instruction/word*8 words = 9 instructions.
    
    8 32-bit words is 8*4 = 32 bytes.
Samuel Lee says:

July 14, 2022 at 1:36 am

I understand Graviton3 is based on Neoverse V1 (https://developer.arm.com/documentation/PJDOC-466751330-9685/0101/).

I’m sure there is performance on the table if you were to unroll – looking at the V1 software optimization guide I think the critical resource is the M0 pipe where all of the predicate handling instructions are run – with cmpge having a latency of 4 cycles.

I think to maximise perf you would have a main loop where you ensure the load mask is all true for the next 4 loads, something like: https://godbolt.org/z/Mxh7sTen7
(I just checked it compiles / looks good, I have not actually tried to run it, so apologies if there is a dumb logic error!)

I _think_ this should mean we can get close to saturating the M0 pipe assuming we don’t hit some bottleneck somewhere else I missed. We have 4x cmpge and 4x incp instructions using M0 per loop. So best case performance would be 0.25 cycles/integer (8 cycles / 32 integers), so about ~3x faster! 🙂