Update: This was gcc. If I compile with clang, I get the following:
despace(buffer, N) : 1.40 ns per operation
neon_despace(buffer, N) : 1.04 ns per operation
neon_despace_branchless(buffer, N) : 1.01 ns per operation
I don’t think that the difference between 1.04 and 1.01 is actually significant, but it still a nice result.
Martins Mozeikosays:
I updated code to do only one VTBL1 instruction on AArch64 (based on information from Sam’s comment).
Now the code is reasonably faster than “neon_despace” variant:
despace(buffer, N) : 6.65 ns per operation
neon_despace(buffer, N) : 4.00 ns per operation
neon_despace_branchless(buffer, N) : 3.54 ns per operation
Cyril Lashkevichsays:
But result is incorrect. The table for vqtblq1 should be calculated in different way.
Interesting post – and good analysis on the # of cycles.
It would be very interesting to see a power comparison – after all if a 95w x86 takes 1.2 cycles to process something but a 20w ARM takes 2.4 cycles – the x86 is eating huge amounts of power to achieve its result.
What would the table look like if we broke it down using the # of cycles and the performance per watt ?
My guess is you’d find the ARM beating the x86 significantly.
matthewsays:
There are 4.5W Kaby Lake CPUs out there which have SSE4, so I don’t think you’re going to get much joy by normalising for power.
I think this is just an instance where x86 has an instruction that ARM is missing. It’s not really fundamental to the architecture of either CPU.
Cyril Lashkevichsays:
is_not_zero can be implemented in one instruction on arm64.
Wouldn’t you prefer vaddlvq_u8(vorrq_u8(v,vdupq_n_u8(1)))? It is the same number of instructions and the movi call in a tight loop could get optimized away.
On the processor I am using, vqmovn_u64 (uqxtn) has a latency of 4 and a throughput of 1; meanwhile uaddlv has a latency of 8 (over 8 bit values) and the same throughput. So vqmovn_u64 (uqxtn) is preferable I think.
Most other operations, such as a shift by an immediate (ushr) or a bitwise OR (vorr) has a latency of three. So something like vaddlvq_u8(vshrq_n_u8(v,7)) ends up having a latency of 8+3= 11 cycles.
Your GitHub code has a mistake. The `vld4q_u8` intrinsic will interleave the words. For instance, the first vector will have characters { 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51 }. The correct method is to use `vld1q_u8` four times, incrementing the address by 16 each time.
Here’s branchless NEON version using VTBL1 instriction: https://gist.github.com/mmozeiko/be2e8afdf5a0a82b7dbdc8f013abfb5f
On my Raspberry Pi 3 (AArch64, Cortex-A53) benchmark compiled with clang 4.0.1 shows that this variant is a tiny bit faster than neon_despace:
despace(buffer, N) : 6.81 ns per operation
neon_despace(buffer, N) : 4.15 ns per operation
neon_despace_branchless(buffer, N) : 4.09 ns per operation
Thanks. Currently, I get the following:
So it is not great.
Update: This was gcc. If I compile with clang, I get the following:
I don’t think that the difference between 1.04 and 1.01 is actually significant, but it still a nice result.
I updated code to do only one VTBL1 instruction on AArch64 (based on information from Sam’s comment).
Now the code is reasonably faster than “neon_despace” variant:
despace(buffer, N) : 6.65 ns per operation
neon_despace(buffer, N) : 4.00 ns per operation
neon_despace_branchless(buffer, N) : 3.54 ns per operation
But result is incorrect. The table for vqtblq1 should be calculated in different way.
Wow, impressive work, that’s blazing fast even if it’s an ARM. How did you figure this out?
Interesting post – and good analysis on the # of cycles.
It would be very interesting to see a power comparison – after all if a 95w x86 takes 1.2 cycles to process something but a 20w ARM takes 2.4 cycles – the x86 is eating huge amounts of power to achieve its result.
What would the table look like if we broke it down using the # of cycles and the performance per watt ?
My guess is you’d find the ARM beating the x86 significantly.
There are 4.5W Kaby Lake CPUs out there which have SSE4, so I don’t think you’re going to get much joy by normalising for power.
I think this is just an instance where x86 has an instruction that ARM is missing. It’s not really fundamental to the architecture of either CPU.
is_not_zero can be implemented in one instruction on arm64.
static inline uint16_t is_not_zero(uint8x16_t v) {
return vaddlvq_u8(v);
}
Thanks. It seems that vaddlvq_u8 might be slower.
It’s useful when you need count of 0xff in vector. It’s (vaddlvq_u8(v) >> 8) + 1.
Yes. Important observation.
Wouldn’t you prefer vaddlvq_u8(vorrq_u8(v,vdupq_n_u8(1)))? It is the same number of instructions and the movi call in a tight loop could get optimized away.
Depends on context. For example here using vaddlvq_u8 improves performance a bit. https://gist.github.com/notorca/c731fc6a916849c3be4f4a8f55b3c583
Cortex A53 (pine64):
despace(buffer, N) : 3.48 ns per operation
neon_despace(buffer, N) : 2.11 ns per operation
neon_despace_branchless(buffer, N) : 2.04 ns per operation
neon_despace_branchless_my(buffer, N) : 1.83 ns per operation
iPhone SE:
pointer alignment = 32768 bytes
memcpy(tmpbuffer,buffer,N) : 0.093 ns per operation
despace(buffer, N) : 0.79 ns per operation
neon_despace(buffer, N) : 0.64 ns per operation
neon_despace_branchless(buffer, N) : 0.43 ns per operation
neon_despace_branchless_my(buffer, N) : 0.40 ns per operation
On the processor I am using, vqmovn_u64 (uqxtn) has a latency of 4 and a throughput of 1; meanwhile uaddlv has a latency of 8 (over 8 bit values) and the same throughput. So vqmovn_u64 (uqxtn) is preferable I think.
Most other operations, such as a shift by an immediate (ushr) or a bitwise OR (vorr) has a latency of three. So something like vaddlvq_u8(vshrq_n_u8(v,7)) ends up having a latency of 8+3= 11 cycles.
Or vaddlvq_u8(vshrq_n_u8(v,7))?
Slight correction: In AArch64 the TBL instruction can shuffle 16B at once, this is exposed in the intrinsics with vqtblq.
Yes it does, e.g., vqtbl1q_u8. My mistake.
Your GitHub code has a mistake. The `vld4q_u8` intrinsic will interleave the words. For instance, the first vector will have characters { 0, 1, 2, 3, 16, 17, 18, 19, 32, 33, 34, 35, 48, 49, 50, 51 }. The correct method is to use `vld1q_u8` four times, incrementing the address by 16 each time.
Good catch!