Daniel Lemire's blog

, 13 min read

Pruning spaces from strings quickly on ARM processors

19 thoughts on “Pruning spaces from strings quickly on ARM processors”

  1. Martins Mozeiko says:

    Here’s branchless NEON version using VTBL1 instriction: https://gist.github.com/mmozeiko/be2e8afdf5a0a82b7dbdc8f013abfb5f

    On my Raspberry Pi 3 (AArch64, Cortex-A53) benchmark compiled with clang 4.0.1 shows that this variant is a tiny bit faster than neon_despace:

    despace(buffer, N) : 6.81 ns per operation
    neon_despace(buffer, N) : 4.15 ns per operation
    neon_despace_branchless(buffer, N) : 4.09 ns per operation

    1. Thanks. Currently, I get the following:

      despace(buffer, N)                      :  1.40 ns per operation
      neon_despace(buffer, N)                 :  1.07 ns per operation
      neon_despace_branchless(buffer, N)      :  3.81 ns per operation
      

      So it is not great.

      1. Update: This was gcc. If I compile with clang, I get the following:

        despace(buffer, N)                      :  1.40 ns per operation
        neon_despace(buffer, N)                 :  1.04 ns per operation
        neon_despace_branchless(buffer, N)      :  1.01 ns per operation
        

        I don’t think that the difference between 1.04 and 1.01 is actually significant, but it still a nice result.

        1. Martins Mozeiko says:

          I updated code to do only one VTBL1 instruction on AArch64 (based on information from Sam’s comment).

          Now the code is reasonably faster than “neon_despace” variant:
          despace(buffer, N) : 6.65 ns per operation
          neon_despace(buffer, N) : 4.00 ns per operation
          neon_despace_branchless(buffer, N) : 3.54 ns per operation

          1. Cyril Lashkevich says:

            But result is incorrect. The table for vqtblq1 should be calculated in different way.

  2. Wow, impressive work, that’s blazing fast even if it’s an ARM. How did you figure this out?

  3. Interesting post – and good analysis on the # of cycles.

    It would be very interesting to see a power comparison – after all if a 95w x86 takes 1.2 cycles to process something but a 20w ARM takes 2.4 cycles – the x86 is eating huge amounts of power to achieve its result.

    What would the table look like if we broke it down using the # of cycles and the performance per watt ?

    My guess is you’d find the ARM beating the x86 significantly.

    1. matthew says:

      There are 4.5W Kaby Lake CPUs out there which have SSE4, so I don’t think you’re going to get much joy by normalising for power.

      I think this is just an instance where x86 has an instruction that ARM is missing. It’s not really fundamental to the architecture of either CPU.

  4. Cyril Lashkevich says:

    is_not_zero can be implemented in one instruction on arm64.

    static inline uint16_t is_not_zero(uint8x16_t v) {
    return vaddlvq_u8(v);
    }

    1. Thanks. It seems that vaddlvq_u8 might be slower.

      1. Cyril Lashkevich says:

        It’s useful when you need count of 0xff in vector. It’s (vaddlvq_u8(v) >> 8) + 1.

        1. Yes. Important observation.

          Wouldn’t you prefer vaddlvq_u8(vorrq_u8(v,vdupq_n_u8(1)))? It is the same number of instructions and the movi call in a tight loop could get optimized away.

          1. Cyril Lashkevich says:

            Depends on context. For example here using vaddlvq_u8 improves performance a bit. https://gist.github.com/notorca/c731fc6a916849c3be4f4a8f55b3c583

            Cortex A53 (pine64):
            despace(buffer, N) : 3.48 ns per operation
            neon_despace(buffer, N) : 2.11 ns per operation
            neon_despace_branchless(buffer, N) : 2.04 ns per operation
            neon_despace_branchless_my(buffer, N) : 1.83 ns per operation

            iPhone SE:
            pointer alignment = 32768 bytes
            memcpy(tmpbuffer,buffer,N) : 0.093 ns per operation

            despace(buffer, N) : 0.79 ns per operation
            neon_despace(buffer, N) : 0.64 ns per operation
            neon_despace_branchless(buffer, N) : 0.43 ns per operation
            neon_despace_branchless_my(buffer, N) : 0.40 ns per operation

            1. On the processor I am using, vqmovn_u64 (uqxtn) has a latency of 4 and a throughput of 1; meanwhile uaddlv has a latency of 8 (over 8 bit values) and the same throughput. So vqmovn_u64 (uqxtn) is preferable I think.

              Most other operations, such as a shift by an immediate (ushr) or a bitwise OR (vorr) has a latency of three. So something like vaddlvq_u8(vshrq_n_u8(v,7)) ends up having a latency of 8+3= 11 cycles.

          2. Or vaddlvq_u8(vshrq_n_u8(v,7))?

  5. Sam says:

    Slight correction: In AArch64 the TBL instruction can shuffle 16B at once, this is exposed in the intrinsics with vqtblq.

    1. Yes it does, e.g., vqtbl1q_u8. My mistake.

  6. Derek Ledbetter says:

    Your GitHub code has a mistake. The `vld4q_u8` intrinsic will interleave the words. For instance, the first vector will have characters { 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51 }. The correct method is to use `vld1q_u8` four times, incrementing the address by 16 each time.

    1. Good catch!