, 6 min read
Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)
7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)”
, 6 min read
7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)”
Cache is often faster than 20GB/s, but all that depends on the CPU, RAM and size of data. You tested on an M2, but a Cortex A55 likely gives very different results, so one shouldn’t overly assume general performance with just one test (particularly with ARM).
In terms of optimisation, I suspect a loop of
accum = vsraq_n_u8(accum, input, 7)
to be simpler/faster. You’ll want to unroll it a bit, with multiple accumulators, to get around latency limitations of the instruction, and then will need to aggregate results every 255*unroll iterations.I have updated the blog post with a faster routine. 🙂
Well that goes to show that your initial assumption of 20GB/s being hard to exceed, can be achieved if you try. =)
Interested to see what you get with a
vsraq_n_u8
strategy!Point taken.
Your ‘simpler’ variant may not compile without adding a
vreinterpretq_s8_u8
to thevaddvq_s8
line.Tested on a Neoverse N1:
scalar (with autovec)
ns/bytes 0.169661
GB/s 5.89412
ns/bytes 0.177977
GB/s 5.61869
ns/bytes 0.174651
GB/s 5.72571
kvakil
ns/bytes 0.0565536
GB/s 17.6824
ns/bytes 0.0565536
GB/s 17.6824
ns/bytes 0.0582169
GB/s 17.1771
faster
ns/bytes 0.0465735
GB/s 21.4714
ns/bytes 0.0482369
GB/s 20.731
ns/bytes 0.0482369
GB/s 20.731
shift
ns/bytes 0.0149701
GB/s 66.8
ns/bytes 0.0166334
GB/s 60.12
ns/bytes 0.0166334
GB/s 60.12
Shift version: https://godbolt.org/z/azaqfP5ox
Yeah, USRA is the perfect instruction for this task. On M2, with 3-cycle latency and 4-per-cycle throughput, you may want to split it across 12 seperate accumulator registers. However, you’d be memory bound. M2 only has 3-per-cycle load throughput, so it should max-out at 9 separate accumulator registers. (48-bytes/cycle at 3.5GHz gives a theoretical 168GB/s, but that’s just the maximum speed for loads.)
I agree that to maximise throughput on Apple Firestorm it seems like at least 9 accumulators in the innermost loop, with 3 of them being written to by a USRA instruction per cycle is the best you can do. Dougall, I really appreciate https://dougallj.github.io/applecpu/ !
For Neoverse N1/N2, V1/V2 this is not quite optimal though; only half of their ASIMD pipelines support USRA, so for N1/N2 the maximum throughput is 16B/cycle, and for V1/V2 the maximum throughput is 32B/cycle when using USRA alone. See the software optimization guides
For these microarchitectures I think it’s better to have 2/3s of the accumulators written with USRA, and 1/3 of the accumulators written with 2 instructions: CMLT (zero) and SUB.
If scheduled correctly, this should give a 50% throughput improvement vs. using the pure USRA approach (i.e. 24B/cycle and 48B/cycle respectively).
You need at least 8 USRA accumulator registers to make full use of USRA on V1/V2 (4 cycles latency x 2 pipes throughput). Then you can have 4 accumulators for CMLT+SUB approach.
As Firestorm is bound by loads (3 ASIMD loads per cycle vs. 4 ASIMD execution units), this variant should run as fast on those uarch as the fully unrolled pure USRA approach.
(Untested!) example code here:
https://godbolt.org/z/1M8411fGf
Note it looks like GCC does what I intended, but clang unfortunately optimizes the 2 instructions back to USRA.