Daniel Lemire's blog

, 6 min read

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)”

  1. -.- says:

    but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits

    Cache is often faster than 20GB/s, but all that depends on the CPU, RAM and size of data. You tested on an M2, but a Cortex A55 likely gives very different results, so one shouldn’t overly assume general performance with just one test (particularly with ARM).

    In terms of optimisation, I suspect a loop of accum = vsraq_n_u8(accum, input, 7) to be simpler/faster. You’ll want to unroll it a bit, with multiple accumulators, to get around latency limitations of the instruction, and then will need to aggregate results every 255*unroll iterations.

    1. I have updated the blog post with a faster routine. 🙂

      1. -.- says:

        Well that goes to show that your initial assumption of 20GB/s being hard to exceed, can be achieved if you try. =)

        Interested to see what you get with a vsraq_n_u8 strategy!

        1. Point taken.

  2. -.- says:

    Your ‘simpler’ variant may not compile without adding a vreinterpretq_s8_u8 to the vaddvq_s8 line.

    Tested on a Neoverse N1:

    scalar (with autovec)
    ns/bytes 0.169661
    GB/s 5.89412
    ns/bytes 0.177977
    GB/s 5.61869
    ns/bytes 0.174651
    GB/s 5.72571

    kvakil
    ns/bytes 0.0565536
    GB/s 17.6824
    ns/bytes 0.0565536
    GB/s 17.6824
    ns/bytes 0.0582169
    GB/s 17.1771

    faster
    ns/bytes 0.0465735
    GB/s 21.4714
    ns/bytes 0.0482369
    GB/s 20.731
    ns/bytes 0.0482369
    GB/s 20.731

    shift
    ns/bytes 0.0149701
    GB/s 66.8
    ns/bytes 0.0166334
    GB/s 60.12
    ns/bytes 0.0166334
    GB/s 60.12

    Shift version: https://godbolt.org/z/azaqfP5ox

    1. Dougall Johnson says:

      Yeah, USRA is the perfect instruction for this task. On M2, with 3-cycle latency and 4-per-cycle throughput, you may want to split it across 12 seperate accumulator registers. However, you’d be memory bound. M2 only has 3-per-cycle load throughput, so it should max-out at 9 separate accumulator registers. (48-bytes/cycle at 3.5GHz gives a theoretical 168GB/s, but that’s just the maximum speed for loads.)

      1. Samuel Lee says:

        I agree that to maximise throughput on Apple Firestorm it seems like at least 9 accumulators in the innermost loop, with 3 of them being written to by a USRA instruction per cycle is the best you can do. Dougall, I really appreciate https://dougallj.github.io/applecpu/ !

        For Neoverse N1/N2, V1/V2 this is not quite optimal though; only half of their ASIMD pipelines support USRA, so for N1/N2 the maximum throughput is 16B/cycle, and for V1/V2 the maximum throughput is 32B/cycle when using USRA alone. See the software optimization guides

        For these microarchitectures I think it’s better to have 2/3s of the accumulators written with USRA, and 1/3 of the accumulators written with 2 instructions: CMLT (zero) and SUB.

        If scheduled correctly, this should give a 50% throughput improvement vs. using the pure USRA approach (i.e. 24B/cycle and 48B/cycle respectively).
        You need at least 8 USRA accumulator registers to make full use of USRA on V1/V2 (4 cycles latency x 2 pipes throughput). Then you can have 4 accumulators for CMLT+SUB approach.

        As Firestorm is bound by loads (3 ASIMD loads per cycle vs. 4 ASIMD execution units), this variant should run as fast on those uarch as the fully unrolled pure USRA approach.

        (Untested!) example code here:
        https://godbolt.org/z/1M8411fGf

        Note it looks like GCC does what I intended, but clang unfortunately optimizes the 2 instructions back to USRA.