, 3 min read
Filtering numbers quickly with SVE on Amazon Graviton 3 processors
4 thoughts on “Filtering numbers quickly with SVE on Amazon Graviton 3 processors”
, 3 min read
4 thoughts on “Filtering numbers quickly with SVE on Amazon Graviton 3 processors”
Now I have to ask: how big is the vector?
Ah. It is a secret.
(It appears to be 32 bytes.)
You can actually figure it out from first principles. There are 9 instructions in the main loop…
.LBB0_1: // =>This Inner Loop Header: Depth=1
ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
add x8, x10, x8
cmpge p1.s, p0/z, z0.s, #0
compact z0.s, p1, z0.s
cntp x11, p0, p1.s
st1w { z0.s }, p0, [x2, x9, lsl #2]
add x9, x11, x9
whilelt p0.s, x8, x1
b.ne .LBB0_1
I report 1.125 instructions per 32-bit words. 1.125 instruction/word*8 words = 9 instructions.
8 32-bit words is 8*4 = 32 bytes.
I understand Graviton3 is based on Neoverse V1 (https://developer.arm.com/documentation/PJDOC-466751330-9685/0101/).
I’m sure there is performance on the table if you were to unroll – looking at the V1 software optimization guide I think the critical resource is the M0 pipe where all of the predicate handling instructions are run – with cmpge having a latency of 4 cycles.
I think to maximise perf you would have a main loop where you ensure the load mask is all true for the next 4 loads, something like: https://godbolt.org/z/Mxh7sTen7
(I just checked it compiles / looks good, I have not actually tried to run it, so apologies if there is a dumb logic error!)
I _think_ this should mean we can get close to saturating the M0 pipe assuming we don’t hit some bottleneck somewhere else I missed. We have 4x cmpge and 4x incp instructions using M0 per loop. So best case performance would be 0.25 cycles/integer (8 cycles / 32 integers), so about ~3x faster! 🙂