The TBX instruction (as opposed to TBL) is designed for this purpose and saves you doing the merge operations.
I suppose you could try a bunch of VPSHUFB’s on x86, but it’s not quite as efficient; might be enough to beat scalar code perhaps? TBL4/TBX4 isn’t exactly fast on ARM, so the shuffles on x86 may have a chance…
Fast arbitrary 8-bit->8-bit mapping is nice, but I think only AVX512-VBMI can make it efficient.
The first call should be TBL, not TBX. TBX’s destination register is read+modify, so there’d be a forced move since you still refer to ‘input’ later on.
Other than that, it looks fine, and I don’t know why it’d be slower.
I’m not familiar with the OoO behaviour on the A72. TBL and TBX are the same speed, but TBX does force a dependency chain. Usually doesn’t matter on OoO processors, because they can schedule it in parallel with the next loop iteration.
According to the ARM optimization manual, TBL4/TBX4 have a latency of 15 cycles, but no throughput info is given, so you really do want to try to get the instructions to run in parallel.
Maybe you could test with a combination of the two, e.g.:
Usually doesn’t matter on OoO processors, because they can schedule it
in parallel with the next loop iteration.
At 15 cycles of latency per TBX4, that’s a full 60 cycles of latency for the whole chain. I don’t know what the OoO window is on the A72, but it is hard to imagine that you can totally hide such latency.
Even if you break it into two distinct part… you are still left with 30 + 3… 33 cycles of latency. It makes really hard to go really fast unless you have a crazily long instruction window…
On an Amazon instance, I get slightly better results (different compiler, however)… with a 2% gain for your approach… (it appears to be a genuine 2% gain).
This page states that the A72 has a 128 entry ROB, so a 60 latency chain might be problematic? This page suggests that the throughput for TBL4 is 2 per clock (no clue on accuracy), which seems to suggest that you’d need 30(!) lookups in parallel to maximise throughput.
Your test results (thanks for posting them) seem to suggest that the latency is a problem. I suppose, since the compiler probably isn’t doing it, you could manually unroll the loop and interleave the TBL instructions to help the processor run stuff in parallel. Unrolling 4 times may be enough – it should achieve the same level of concurrency, but reduce the need for merging instructions.
The version I present in my blog post with TBL4 does not have a 30 cycle latency. It can do four TBL4… somewhat in parallel… and then it needs to “OR” the results together (since there are four of them, this requires 3 ORs, but two of them can be done in parallel).
However, TBX4 can save us one bitwise OR so it should be faster, at least in theory, because it reduces the instruction count. However, it comes at a cost: longer dependency chains.
Doing more lookups in parallel should help, but as you observe, a 60-cycle dependency chain is really hard to hide. So the question as to how useful TBX is compared to TBL remains open. The evidence so far suggests that TBX is only moderately useful.
Travis Downssays:
I find a bit hard to believe that the TBL4 implementations all have 2 per clock throughput. A TLB4 would need to read from 5 source registers (the 4 table registers and the control), so 2 per cycle means 10 vector reads per cycle which is a huge amount – plus whatever other reads you do on other vector units at the same time. That’s more reasons that much bigger contemporary chips.
Someone should test it…
-.-says:
The throughput definitely does seem weird. The official A72 optimization guide explicitly leaves them out under AArch64, though does specify 2 per clock for AArch32 operation (4x 64-bit registers) *. My guess is that the post just went with the 2 per clock for all TBL/TBX instructions.
* Even if we interpret 4x 64-bit registers as 2x 128-bit, a VTBX4 in AArch32 would require 4 vector reads (2x source table + source indicies + destination (I assume it needs to read the destination to blend in the bytes?)) per instruction, so doing 2x VTBX4 per clock would mean 8 reads/clock. Entirely possible that the guide is wrong though.
Considering that ORR is a very fast operation, and the cost of TBL4/TBX4 so large, I don’t expect too much of a gain from TBX, but I imagine that there should be one if you can get it to parallelize well.
Travis Downssays:
I think you have enough registers in NEON (64-bit) for the full 256->256 lookup: 32 128-bit registers. So your 16 table registers fit easily, and only a handful of extras are needed for temporaries, etc.
It’s slow just because it needs 2x as many tbl instructions, and those instructions dominate runtime. Indeed, it is approximately 2x as slow, as you’d expect.
So NEON is somewhat close to AVX in terms of total register space, at least if you just look at the ISA. Right?
That is, I can glue together virtually pairs of NEON 128-bit registers and make myself sixteen 256-bit registers “à la AMD”.
Travis Downssays:
Yes, it is identical in the sense that they both have 512 bytes of register space, either 32 x 16b or 16 x 32b.
AVX-512 quadruples that (!!) to 2048 bytes: 32 x 64b.
That is, I can glue together virtually pairs of NEON 128-bit registers
and make myself sixteen 256-bit registers “à la AMD”.
Well logically, yes. You can glue together any number of registers “in software” to create a longer “register”. It’s quite a bit different than what AMD did though, in the sense that they did it in hardware. In software you need N instructions to execute on wider “meta instruction” when you glue together N registers. In hardware, you only need 1: the expansion to N operations happens internally.
In many cases this is much more efficient, since you can run into front-end limitations with so many instructions. This is a primary reason why CPU SIMD gets wider, rather than simply adding more EUs at the same width. That is, we have AVX-512 rather than simply 2x as many 256-bit units, even though 2x as many units is basically strictly more flexible – it is too hard to keep all those units fed at the front end: the CPU needs to be very wide.
GPUs have taken the opposite approach, which ever-increasing numbers of smallish-width EUs which now number in the 1000s on the fastest chips. They can do that because the whole execution model is quite different.
It’s worth noting that I although we are calling it “a la AMD”, Intel used the same strategy for SSE and AVX: AVX-512 was the first time since MMX they didn’t release the initial chips after a width expansion with half-width EUs.
KWilletssays:
I tried this on an M1 Macbook Air and got these values; transform and transformx2 seem to be identical most of the time, but I get a lot of variation between runs.
The TBX instruction (as opposed to TBL) is designed for this purpose and saves you doing the merge operations.
I suppose you could try a bunch of VPSHUFB’s on x86, but it’s not quite as efficient; might be enough to beat scalar code perhaps? TBL4/TBX4 isn’t exactly fast on ARM, so the shuffles on x86 may have a chance…
Fast arbitrary 8-bit->8-bit mapping is nice, but I think only AVX512-VBMI can make it efficient.
In my tests, it is slower. Here is what I tried…
Am I misusing TBX…?
The first call should be TBL, not TBX. TBX’s destination register is read+modify, so there’d be a forced move since you still refer to ‘input’ later on.
Other than that, it looks fine, and I don’t know why it’d be slower.
I’m not familiar with the OoO behaviour on the A72. TBL and TBX are the same speed, but TBX does force a dependency chain. Usually doesn’t matter on OoO processors, because they can schedule it in parallel with the next loop iteration.
According to the ARM optimization manual, TBL4/TBX4 have a latency of 15 cycles, but no throughput info is given, so you really do want to try to get the instructions to run in parallel.
Maybe you could test with a combination of the two, e.g.:
uint8x16_t simd_transform16x2(uint8x16x4_t * table, uint8x16_t input) {
uint8x16_t t1 = vqtbl4q_u8(table[0], input);
t1 = vqtbx4q_u8(t1, table[1], veorq_u8(input, vdupq_n_u8(0x40)));
uint8x16_t t2 = vqtbl4q_u8(table[2], veorq_u8(input, vdupq_n_u8(0x80)));
t2 = vqtbx4q_u8(t2, table[3], veorq_u8(input, vdupq_n_u8(0xc0)));
return vorrq_u8(t1, t2);
}
At 15 cycles of latency per TBX4, that’s a full 60 cycles of latency for the whole chain. I don’t know what the OoO window is on the A72, but it is hard to imagine that you can totally hide such latency.
Even if you break it into two distinct part… you are still left with 30 + 3… 33 cycles of latency. It makes really hard to go really fast unless you have a crazily long instruction window…
(I’ll run more tests.)
Your new approach is better but still apparently slower than what I describe in my blog post…
On an Amazon instance, I get slightly better results (different compiler, however)… with a 2% gain for your approach… (it appears to be a genuine 2% gain).
This page states that the A72 has a 128 entry ROB, so a 60 latency chain might be problematic? This page suggests that the throughput for TBL4 is 2 per clock (no clue on accuracy), which seems to suggest that you’d need 30(!) lookups in parallel to maximise throughput.
Your test results (thanks for posting them) seem to suggest that the latency is a problem. I suppose, since the compiler probably isn’t doing it, you could manually unroll the loop and interleave the TBL instructions to help the processor run stuff in parallel. Unrolling 4 times may be enough – it should achieve the same level of concurrency, but reduce the need for merging instructions.
The version I present in my blog post with TBL4 does not have a 30 cycle latency. It can do four TBL4… somewhat in parallel… and then it needs to “OR” the results together (since there are four of them, this requires 3 ORs, but two of them can be done in parallel).
However, TBX4 can save us one bitwise OR so it should be faster, at least in theory, because it reduces the instruction count. However, it comes at a cost: longer dependency chains.
Doing more lookups in parallel should help, but as you observe, a 60-cycle dependency chain is really hard to hide. So the question as to how useful TBX is compared to TBL remains open. The evidence so far suggests that TBX is only moderately useful.
I find a bit hard to believe that the TBL4 implementations all have 2 per clock throughput. A TLB4 would need to read from 5 source registers (the 4 table registers and the control), so 2 per cycle means 10 vector reads per cycle which is a huge amount – plus whatever other reads you do on other vector units at the same time. That’s more reasons that much bigger contemporary chips.
Someone should test it…
The throughput definitely does seem weird. The official A72 optimization guide explicitly leaves them out under AArch64, though does specify 2 per clock for AArch32 operation (4x 64-bit registers) *. My guess is that the post just went with the 2 per clock for all TBL/TBX instructions.
* Even if we interpret 4x 64-bit registers as 2x 128-bit, a VTBX4 in AArch32 would require 4 vector reads (2x source table + source indicies + destination (I assume it needs to read the destination to blend in the bytes?)) per instruction, so doing 2x VTBX4 per clock would mean 8 reads/clock. Entirely possible that the guide is wrong though.
Considering that ORR is a very fast operation, and the cost of TBL4/TBX4 so large, I don’t expect too much of a gain from TBX, but I imagine that there should be one if you can get it to parallelize well.
I think you have enough registers in NEON (64-bit) for the full 256->256 lookup: 32 128-bit registers. So your 16 table registers fit easily, and only a handful of extras are needed for temporaries, etc.
It’s slow just because it needs 2x as many tbl instructions, and those instructions dominate runtime. Indeed, it is approximately 2x as slow, as you’d expect.
I think you are correct.
I was somehow under the impression that NEON had 16 registers, but aarch64 has 32 128-bit registers. This is more than I thought!
Yeah both 32-bit and 64-bit ARM have 32 NEON registers, but in the 32-bit case they are only 64-bit wide.
So NEON is somewhat close to AVX in terms of total register space, at least if you just look at the ISA. Right?
That is, I can glue together virtually pairs of NEON 128-bit registers and make myself sixteen 256-bit registers “à la AMD”.
Yes, it is identical in the sense that they both have 512 bytes of register space, either 32 x 16b or 16 x 32b.
AVX-512 quadruples that (!!) to 2048 bytes: 32 x 64b.
Well logically, yes. You can glue together any number of registers “in software” to create a longer “register”. It’s quite a bit different than what AMD did though, in the sense that they did it in hardware. In software you need N instructions to execute on wider “meta instruction” when you glue together N registers. In hardware, you only need 1: the expansion to N operations happens internally.
In many cases this is much more efficient, since you can run into front-end limitations with so many instructions. This is a primary reason why CPU SIMD gets wider, rather than simply adding more EUs at the same width. That is, we have AVX-512 rather than simply 2x as many 256-bit units, even though 2x as many units is basically strictly more flexible – it is too hard to keep all those units fed at the front end: the CPU needs to be very wide.
GPUs have taken the opposite approach, which ever-increasing numbers of smallish-width EUs which now number in the 1000s on the fastest chips. They can do that because the whole execution model is quite different.
It’s worth noting that I although we are calling it “a la AMD”, Intel used the same strategy for SSE and AVX: AVX-512 was the first time since MMX they didn’t release the initial chips after a width expansion with half-width EUs.
I tried this on an M1 Macbook Air and got these values; transform and transformx2 seem to be identical most of the time, but I get a lot of variation between runs.
transform(map, values,volume) : 4083 ns total, 1.00 ns per input key
neon_transform(map, values,volume) : 1416 ns total, 0.35 ns per input key
neon_transformx(map, values,volume) : 1583 ns total, 0.39 ns per input key
neon_transformx2(map, values,volume) : 1417 ns total, 0.35 ns per input key
neon_transform_ascii(map, values,volume) : 625 ns total, 0.15 ns per input key
neon_transform_ascii64(map, values,volume) : 417 ns total, 0.10 ns per input key
neon_transform_nada(map, values,1000) : 7041 ns total, 7.04 ns per input key
neon_transform_nada(map, values,10000) : 70417 ns total, 7.04 ns per input key