, 5 min read
Quickly pruning elements in SIMD vectors using the simdprune library
9 thoughts on “Quickly pruning elements in SIMD vectors using the simdprune library”
, 5 min read
9 thoughts on “Quickly pruning elements in SIMD vectors using the simdprune library”
In the example of the README of your project, shouldn’t the zero vector 0,0,0,0,0,0,0,0 be a one vector 1,…,1?
Yes. Fixed. Thank you.
There was a thread on stack exchange about packing left from a mask, and they recommended using PEXT to pull the bits. Would that work here?
Yes. It can be made to work. It might be very useful for pruning bytes because the current solution, with a large table, is not ideal.
How to make it all come together for high efficiency is the tricky part.
Here’s the thread: http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask
They also mention VCOMPRESSPS for 32-bit values under AVX512.
I am aware of vcompress and it is mentioned in the README of the library. It is not super useful because none of us has access to it.
The BMI code is cool.
I have added, for benchmarking purpose, the BMI approach and, in my tests, it is slower. The BMI instructions can be nice, but they often have high latency so if you string them with data dependencies, it is not always super efficient.
Ryzen instructions just came out on Agner’s site, and PEXT/PDEP have reciprocal latency of 18 cycles. 🙁
@KWillets
That sounds bad. On the other hand, I am not sure that PEXT/PDEP is common in software, or even that it will become common.