rbit only reverses bits. It doesn’t do an arbitrary bit shuffle.
AArch64’s NEON would actually do a decent job (better than SSE4), but the instruction sequence would be much longer than what is achievable with AVX-512.
Having said that, an AVX2/NEON implementation would be interesting. Should be possible to rshift the indexes by 3, shuffle bytes into the right location, then use a TEST to amplify the relevant bits, then extract them.
Laine Taffin Altmansays:
Speaking of AArch64—how does SVE(2) fare at this? Might you be able to cleverly exploit BGRP/BDEP/BEXT for this, or just do something similar to AVX-512 using the predicate registers?
Untested here as well, but my gut says that won’t work, because you’re shifting the bits back to their original position after singling them out. You could probably make it work if the lsl shifted the bits based on index (which needs to be incremented each loop cycle).
More problematic is that you’re only processing one bit per 64, which doesn’t exactly scream fast (not to mention that horizontal reductions like orv typically aren’t highly performant either).
The challenge with SVE2 is that the vector width isn’t fixed, whilst this is a fixed width problem (single 64-bit value).
You could take a similar approach to an SSE/NEON implementation, using a TBL+shift then extracting bits via the predicate. For vector width >=512-bit, you could also use a similar approach to what’s described in the first post (skipping the need to shift), though you’d need to implement two code paths with this technique.
camel-cdrsays:
I had a go at a rvv implementation, I’m not able to test it rn, but I think it’s roughly correct. Probably not optimal though:
vsetivli x0, 1, e64, m1, ma, ta
vle8.v v0, (a0) # load uint64_t
vsetivli x0, 64, e16, m8, ma, ta
vle8.v v8, (a1) # load uint16_t[64]
vsetivli x0, 64, e8, m4, ma, ta
vmv.v.i v4, 1
vmv.v.i v4, 0, v0 # mask to 0/1 bytes
vrgathere16.vv v4, v4, v8 # gather does the shuffle
vmseq.vi v0, v4, 0 # 0/1 bytes to mask
vsetivli x0, 1, e64, m1, ma, ta
vse8.v v0, (a1) # store mask
camel-cdrsays:
I forgot to mention that the above should work for any implementation with a vlen>=128.
camel-cdrsays:
Edit: It should’ve been vmerge.vim instead of the second vmv.v.i, because that one isn’t maskable.
MajorTomsays:
I don’t know when, or why, but you just saved me hours upon hours of research in the future. Thanks in advance!
Likely faster to just use the purpose built
vpshufbitqmb
: https://godbolt.org/z/qssovhbcrBlog post updated, thanks.
Have you profiled
bitshuffle
? I remember trying to implement bit unpacking with it and it was MUCH slower than the AVX2 version of fastunpack.I haven’t benchmarked it. What is certain is that we have far fewer instructions with it.
Keep in mind that this article is about arbitrarily shuffling one 64-bit integer.
If you’re handling more than one integer, and the permutation isn’t so arbitrary, there may be faster approaches.
And ARM does it in a single rbit instruction…
Who is the RISC again?
rbit
only reverses bits. It doesn’t do an arbitrary bit shuffle.AArch64’s NEON would actually do a decent job (better than SSE4), but the instruction sequence would be much longer than what is achievable with AVX-512.
Having said that, an AVX2/NEON implementation would be interesting. Should be possible to rshift the indexes by 3, shuffle bytes into the right location, then use a
TEST
to amplify the relevant bits, then extract them.Speaking of AArch64—how does SVE(2) fare at this? Might you be able to cleverly exploit BGRP/BDEP/BEXT for this, or just do something similar to AVX-512 using the predicate registers?
One fun issue is that SVE has variable length registers.
Ooo, wait, idea! Totally untested asm follows; should fit the same C signature as the rest:
mov x2, #0
mov d0, #0
loop:
whilelo p0.d, x2, #64
b.none loop_end
mov z1.d, p0/z, x0
ld1d z2.d, p0/z, [x1, x2]
lsr z1.d, p0/m, z2.d
and z1.d, p0/m, #1
lsl z1.d, p0/m, z2.d
orv d1, p0, z1
orr d0, d0, d1
incd x2
b loop
loop_end:
mov x0, d0
ret
Untested here as well, but my gut says that won’t work, because you’re shifting the bits back to their original position after singling them out. You could probably make it work if the
lsl
shifted the bits based onindex
(which needs to be incremented each loop cycle).More problematic is that you’re only processing one bit per 64, which doesn’t exactly scream fast (not to mention that horizontal reductions like
orv
typically aren’t highly performant either).The challenge with SVE2 is that the vector width isn’t fixed, whilst this is a fixed width problem (single 64-bit value).
You could take a similar approach to an SSE/NEON implementation, using a TBL+shift then extracting bits via the predicate. For vector width >=512-bit, you could also use a similar approach to what’s described in the first post (skipping the need to shift), though you’d need to implement two code paths with this technique.
I had a go at a rvv implementation, I’m not able to test it rn, but I think it’s roughly correct. Probably not optimal though:
vsetivli x0, 1, e64, m1, ma, ta
vle8.v v0, (a0) # load uint64_t
vsetivli x0, 64, e16, m8, ma, ta
vle8.v v8, (a1) # load uint16_t[64]
vsetivli x0, 64, e8, m4, ma, ta
vmv.v.i v4, 1
vmv.v.i v4, 0, v0 # mask to 0/1 bytes
vrgathere16.vv v4, v4, v8 # gather does the shuffle
vmseq.vi v0, v4, 0 # 0/1 bytes to mask
vsetivli x0, 1, e64, m1, ma, ta
vse8.v v0, (a1) # store mask
I forgot to mention that the above should work for any implementation with a vlen>=128.
Edit: It should’ve been vmerge.vim instead of the second vmv.v.i, because that one isn’t maskable.
I don’t know when, or why, but you just saved me hours upon hours of research in the future. Thanks in advance!