congratulation! your lookup tables are quite large for AVX2.
In TurboPFor:Integer Compression I’m using a using a 2k table instead of a 8k table and converting the mask with: _mm256_cvtepu8_epi32(_mm_cvtsi64_si128(*(uint64_t *)(vecDecodeTable[byteA]))).

I’m also using popcount instead of a length table for advanceA/B.

This comes up often enough. For example, I expect we could speed up Roaring bitmaps.

Erling Andersensays:

An application.

Assume you want to find the union of some index sets. Also assume the union should be sorted. This relevant when doing a socalled symbolic Cholesky factorization.

Most likely you will have a bit vector that marks whether an element is in the union. If you have the bit vector it might be more efficient to generated the sorted union from that than doing a quicksort.

I know this is an older post and I hope no-one minds the late comment. First of all: thank you (and powturbo) so much for posting this! I was in the process of optimising a genetic algorithm and identified a part of it as the bottleneck, where I was iterating over the set bits of a lot of 64-bit integers. My original approach was pretty much what you described in your previous post (taking advantage of tzcnt). Replacing that with a version similar to what you outlined here single-handedly brought down the runtime for evaluating a single generation from 3.5 s to 0.6 s. I had to run multiple tests, comparing the results against my previous implementation, because I couldn’t believe how fast it was all of a sudden. So again: thank you very much!

Having looked at the code you posted on GitHub, I have one (probably naive) question: It seems like there, you’re explicitly aligning the lookup tables to 16-bytes and then use aligned loads to read 256-bit vectors from the table. I was under the impression that unaligned loads required 32-byte alignments but I’m really not that experienced in that area and was curious where my misunderstanding lies.

powturbosays:congratulation! your lookup tables are quite large for AVX2.

In TurboPFor:Integer Compression I’m using a using a 2k table instead of a 8k table and converting the mask with: _mm256_cvtepu8_epi32(_mm_cvtsi64_si128(*(uint64_t *)(vecDecodeTable[byteA]))).

I’m also using popcount instead of a length table for advanceA/B.

Daniel Lemiresays:For comparison purposes, I have added “turbo” versions of the functions to my repository which follow your description.

wmusays:It’d be great if you checked performance for higher densities, like 0.75, 0.80, 0.99.

Daniel Lemiresays:I have added density 0.9 to the blog post. The results are very positive.

wmusays:Thanks, that’s really great.

Jeff Vienneausays:Thanks for sharing I love these types of articles.

I’m assuming the number of bitmap passes is large enough justfy to pulling the 8kB lookup table in cache for the use case?

What would this be used for?

Daniel Lemiresays:What would this be used for?This comes up often enough. For example, I expect we could speed up Roaring bitmaps.

Erling Andersensays:An application.

Assume you want to find the union of some index sets. Also assume the union should be sorted. This relevant when doing a socalled symbolic Cholesky factorization.

Most likely you will have a bit vector that marks whether an element is in the union. If you have the bit vector it might be more efficient to generated the sorted union from that than doing a quicksort.

Daniel Lemiresays:You are correct.

Kay Werndlisays:I know this is an older post and I hope no-one minds the late comment. First of all: thank you (and powturbo) so much for posting this! I was in the process of optimising a genetic algorithm and identified a part of it as the bottleneck, where I was iterating over the set bits of a lot of 64-bit integers. My original approach was pretty much what you described in your previous post (taking advantage of tzcnt). Replacing that with a version similar to what you outlined here single-handedly brought down the runtime for evaluating a single generation from 3.5 s to 0.6 s. I had to run multiple tests, comparing the results against my previous implementation, because I couldn’t believe how fast it was all of a sudden. So again: thank you very much!

Having looked at the code you posted on GitHub, I have one (probably naive) question: It seems like there, you’re explicitly aligning the lookup tables to 16-bytes and then use aligned loads to read 256-bit vectors from the table. I was under the impression that unaligned loads required 32-byte alignments but I’m really not that experienced in that area and was curious where my misunderstanding lies.

Daniel Lemiresays:You are correct. The alignment should match the vector width.

dist1llsays:Thanks for the great article, and sorry for necroposting.

Do I read correctly that the SIMD version’s output contains additional zeroes, unlike the compact representation of the naive implementation?

Daniel Lemiresays:The SIMD version writes the result out in blocks, you are correct.