AVX512_VBMI2 offers VPCOMPRESSB thus one can directly compress 512 bit packed byte vector holding 0-63 values under influence of 64 bit mask. This can replace above unrolled instructions sequence.
Kim Walischsays:
I have implemented a modified version of the AVX512_VBMI2 bitset decoding algorithm in my primesieve project that was partially inspired by Daniel’s previous blog posts on the same topic. The great thing about using VPCOMPRESSB is that this significantly improves performance for sparse bit streams (that are distributed relatively evenly), e.g. if there are only <= 16 bits set in the uint64_t bits variable an algorithm using VPCOMPRESSB would executed only about 1/4 of the instructions compared to the algorithm from this blog post. Here is a link to my AVX512_VBMI2 algorithm: https://github.com/kimwalisch/primesieve/blob/9e4e5773f122f71520a9561282e41a78948e6c89/src/PrimeGenerator.cpp#L422
I think “the bitset 0b111010, you would generate the output 1,3,4,6.” should be “… 1,3,4,5”.
Very interesting as always 👍
AVX512_VBMI2 offers VPCOMPRESSB thus one can directly compress 512 bit packed byte vector holding 0-63 values under influence of 64 bit mask. This can replace above unrolled instructions sequence.
I have implemented a modified version of the AVX512_VBMI2 bitset decoding algorithm in my primesieve project that was partially inspired by Daniel’s previous blog posts on the same topic. The great thing about using VPCOMPRESSB is that this significantly improves performance for sparse bit streams (that are distributed relatively evenly), e.g. if there are only <= 16 bits set in the uint64_t bits variable an algorithm using VPCOMPRESSB would executed only about 1/4 of the instructions compared to the algorithm from this blog post. Here is a link to my AVX512_VBMI2 algorithm: https://github.com/kimwalisch/primesieve/blob/9e4e5773f122f71520a9561282e41a78948e6c89/src/PrimeGenerator.cpp#L422