15th May 2023, 6 min read

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

7 thoughts on “Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)”

-.- says:

May 15, 2023 at 9:23 pm

but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits

Cache is often faster than 20GB/s, but all that depends on the CPU, RAM and size of data. You tested on an M2, but a Cortex A55 likely gives very different results, so one shouldn’t overly assume general performance with just one test (particularly with ARM).

In terms of optimisation, I suspect a loop of accum = vsraq_n_u8(accum, input, 7) to be simpler/faster. You’ll want to unroll it a bit, with multiple accumulators, to get around latency limitations of the instruction, and then will need to aggregate results every 255*unroll iterations.
1. Daniel Lemire says:
  
  May 15, 2023 at 10:44 pm
  
  I have updated the blog post with a faster routine. 🙂
  1. -.- says:
    
    May 15, 2023 at 11:55 pm
    
    Well that goes to show that your initial assumption of 20GB/s being hard to exceed, can be achieved if you try. =)
    
    Interested to see what you get with a vsraq_n_u8 strategy!
    1. Daniel Lemire says:
      
      May 16, 2023 at 12:40 am
      
      Point taken.
-.- says:

May 16, 2023 at 12:45 am

Your ‘simpler’ variant may not compile without adding a vreinterpretq_s8_u8 to the vaddvq_s8 line.

Tested on a Neoverse N1:

scalar (with autovec) ns/bytes 0.169661 GB/s 5.89412 ns/bytes 0.177977 GB/s 5.61869 ns/bytes 0.174651 GB/s 5.72571
kvakil ns/bytes 0.0565536 GB/s 17.6824 ns/bytes 0.0565536 GB/s 17.6824 ns/bytes 0.0582169 GB/s 17.1771 faster ns/bytes 0.0465735 GB/s 21.4714 ns/bytes 0.0482369 GB/s 20.731 ns/bytes 0.0482369 GB/s 20.731
shift ns/bytes 0.0149701 GB/s 66.8 ns/bytes 0.0166334 GB/s 60.12 ns/bytes 0.0166334 GB/s 60.12

Shift version: https://godbolt.org/z/azaqfP5ox
1. Dougall Johnson says:
  
  May 17, 2023 at 9:13 am
  
  Yeah, USRA is the perfect instruction for this task. On M2, with 3-cycle latency and 4-per-cycle throughput, you may want to split it across 12 seperate accumulator registers. However, you’d be memory bound. M2 only has 3-per-cycle load throughput, so it should max-out at 9 separate accumulator registers. (48-bytes/cycle at 3.5GHz gives a theoretical 168GB/s, but that’s just the maximum speed for loads.)
  1. Samuel Lee says:
    
    May 19, 2023 at 12:27 am
    
    I agree that to maximise throughput on Apple Firestorm it seems like at least 9 accumulators in the innermost loop, with 3 of them being written to by a USRA instruction per cycle is the best you can do. Dougall, I really appreciate https://dougallj.github.io/applecpu/ !
    
    For Neoverse N1/N2, V1/V2 this is not quite optimal though; only half of their ASIMD pipelines support USRA, so for N1/N2 the maximum throughput is 16B/cycle, and for V1/V2 the maximum throughput is 32B/cycle when using USRA alone. See the software optimization guides
    
    For these microarchitectures I think it’s better to have 2/3s of the accumulators written with USRA, and 1/3 of the accumulators written with 2 instructions: CMLT (zero) and SUB.
    
    If scheduled correctly, this should give a 50% throughput improvement vs. using the pure USRA approach (i.e. 24B/cycle and 48B/cycle respectively).
    You need at least 8 USRA accumulator registers to make full use of USRA on V1/V2 (4 cycles latency x 2 pipes throughput). Then you can have 4 accumulators for CMLT+SUB approach.
    
    As Firestorm is bound by loads (3 ASIMD loads per cycle vs. 4 ASIMD execution units), this variant should run as fast on those uarch as the fully unrolled pure USRA approach.
    
    (Untested!) example code here:
    https://godbolt.org/z/1M8411fGf
    
    Note it looks like GCC does what I intended, but clang unfortunately optimizes the 2 instructions back to USRA.