Daniel Lemire's blog

, 3 min read

Filtering numbers quickly with SVE on Amazon Graviton 3 processors

4 thoughts on “Filtering numbers quickly with SVE on Amazon Graviton 3 processors”

  1. KWillets says:

    Now I have to ask: how big is the vector?

    1. Ah. It is a secret.

      (It appears to be 32 bytes.)

      1. You can actually figure it out from first principles. There are 9 instructions in the main loop…

        .LBB0_1: // =>This Inner Loop Header: Depth=1
        ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
        add x8, x10, x8
        cmpge p1.s, p0/z, z0.s, #0
        compact z0.s, p1, z0.s
        cntp x11, p0, p1.s
        st1w { z0.s }, p0, [x2, x9, lsl #2]
        add x9, x11, x9
        whilelt p0.s, x8, x1
        b.ne .LBB0_1

        I report 1.125 instructions per 32-bit words. 1.125 instruction/word*8 words = 9 instructions.

        8 32-bit words is 8*4 = 32 bytes.

  2. Samuel Lee says:

    I understand Graviton3 is based on Neoverse V1 (https://developer.arm.com/documentation/PJDOC-466751330-9685/0101/).

    I’m sure there is performance on the table if you were to unroll – looking at the V1 software optimization guide I think the critical resource is the M0 pipe where all of the predicate handling instructions are run – with cmpge having a latency of 4 cycles.

    I think to maximise perf you would have a main loop where you ensure the load mask is all true for the next 4 loads, something like: https://godbolt.org/z/Mxh7sTen7
    (I just checked it compiles / looks good, I have not actually tried to run it, so apologies if there is a dumb logic error!)

    I _think_ this should mean we can get close to saturating the M0 pipe assuming we don’t hit some bottleneck somewhere else I missed. We have 4x cmpge and 4x incp instructions using M0 per loop. So best case performance would be 0.25 cycles/integer (8 cycles / 32 integers), so about ~3x faster! 🙂