I think bitset ^= ((uint64_t)1) << r is better than -=, but otherwise, that’s how it is is implemented in tantivy.
Just benched it against the x & -x trick. For a density of 1% over a bitset of 1M capacity, they are very close (around 15ns per elements on my laptop).
(bitset ^= 1u64 << r) was maybe a notch faster.
Travis Downssays:
Using the variable shift is probably generates worse code in all cases on x86, although you might get a tie in benchmarking due to the overhead of callback() dominating. Before BMI (Haswell) variables shifts are somewhat slow (needing 3 uops), although they are better with BMI (shlx and friends) – but with BMI the whole two lines:
uint64_t t = bitset & -bitset;
bitset ^= t;
end up as a single BLSR instruction on compilers that recognize it. It would have been simpler to write the two lines above as the single line:
bitset = bitset & (bitset - 1);
Which is the usual idiom for “clearing the lowest bit” – although gcc was smart enough to compile both to the same form (and you don’t need the temporary and all that).
Paul Masurelsays:
Thanks for the explanation!
I implemented both in Rust, and I am unsure the generated code ended up using a BLSR instruction. I’ll investigate what happened there.
The work done in my bench is small and inlinable so I am not too worried about the callback cost.
Paul Masurelsays:
I had a look at the assembly code. Both version compile to the same assembly. BLSR is indeed used.
Thanks for this. Whether it is faster or not, it’s still worth mentioning this alternative because I believe in some languages, like C#, while you can do
Depends on your use case. There’s some sets of data that tend to group whatever the bits represent to be all the same.
So you may find that there are situations where comparing to 0 or 0xFFFFFFFF tells you right away that the next 64 bits are all the same.
Can’t wait for 128 bit processors, this will work even faster.
There are optimizations to be had everywhere.
Vitalysays:
Did you start blocking Russian IPs by any chance, or is it me moving to Russia from the Netherlands? I know for a fact LinkedIn is blocked in Russia, but I wonder why would anyone want to block this blog :/
What happens in your code when zeros is 63? You could add a branch to check if bitset is equal to 1<<63. Since this will almost never happen, the branch is going to be free (and you only need one branch per word).
The difference is so bit that it’s hard to believe in those results. Did you test those routines functionally?
The fast version (ctz) has been benchmarked thoroughly and is in production code.
The other version was proposed by Wojciech. It is my first time benchmarking it.
I offer my source code, it is quite short, so you should be able to check whether I made a mistake or whether my results are robust. I encourage re-examination of my results. Please prove me wrong!
This code is sensitive to branch prediction so your results will vary depending on the data. I used random data with uniform density. You should expect slower processing with realistic data having non-uniform density.
Roman Leventovsays:
What happens in your code when zeros is 63? You could add a branch to
check if bitset is equal to 1<<63. Since this will almost never
happen, the branch is going to be free (and you only need one branch
per word).
Or, it could be split into two instructions: bitset >>= zeros; bitset >>= 1. I think the previous three operations, especially callback (if it’s something non-trivial) have enough data parallelism to perform this extra operation with the same overall thoughput.
The fast version (ctz) has been benchmarked thoroughly and is in production code.
Another option is that the inferior alternative works worse than it could.
Travis Downssays:
These results are not at all hard to believe and in fact you can ballpark from the results from first principles: consider that the naive approach will take a branch misprediction on at least every set bit, which bounds its performance from above at about 20 cycles per set bit. When the density gets lower it also wastes a lot of time examining every unset bit.
The ctz approach never examines unset bits and just jumps immediately to the next set bit, and only involves about 4 instructions per bit (plus the callback overhead). For really dense bitmaps you can get close to 1 cycle per int!
The main cost as density gets lower is branch prediction on the inner (bitmap != 0) loop: you’ll almost always mispredict the exit iteration and you can pretty much model the performance exactly as 1 mispredict divided by the average number of set bits in a 64-bit bitmap (until density gets so low that most bitmaps are zero: then you approach 1 mispredict per set bit).
You can play some tricks to avoid the mispredicts, such as conditional moves rather than nested loops – but it’s all highly dependent on the bit frequency and typical patterns. I settled on using various strategies, branching and not, depending on the frequency.
The fastest way is to use the expand instructions (ex. _mm256_maskz_expand_epi32) on AVX-512 or a shuffle/permute table for SSE/AVX2 and process mutiple (4/8/16) integers at a time with SIMD. This technique is used in the TurboPFor (https://github.com/powturbo/TurboPFor) package.
Thanks for your comment. I got so-so results in the past from similar vectorized code (some of which was written by Nathan Kurz), but I think I will make a blog post out of it. Crediting you of course, as needed.
Christopher Changsays:
I have hundreds of loops of the following form in my code:
uint32_t variant_uidx = 0;
// Iterate over all variant_ct set bits in variant_include, which is a bitmap of type const uintptr_t*.
for (uint32_t variant_idx = 0; variant_idx < variant_ct; ++variant_idx, ++variant_uidx) {
if (!IsSet(variant_include, variant_uidx)) {
variant_uidx = FindFirst1BitFrom(variant_include, variant_uidx);
}
// ...do stuff...
}
IsSet() and FindFirst1BitFrom() are currently defined on Linux and OS X as follows:
I’ve been pretty happy with the performance of this idiom in both the common all-bits-are-1 case and the sparse case, but if you can find a better approach I’m all ears.
Instead of
uint64_t t = bitset & -bitset;
bitset ^= t;
should also be possible
bitset -= ((uint64_t)1) << r;
I should admit that while there are only two operations here, the performance on a particular platform/compiler may depend and is yet to be tested… 🙂
I think bitset ^= ((uint64_t)1) << r is better than -=, but otherwise, that’s how it is is implemented in tantivy.
Just benched it against the x & -x trick. For a density of 1% over a bitset of 1M capacity, they are very close (around 15ns per elements on my laptop).
(bitset ^= 1u64 << r) was maybe a notch faster.
Using the variable shift is probably generates worse code in all cases on x86, although you might get a tie in benchmarking due to the overhead of callback() dominating. Before BMI (Haswell) variables shifts are somewhat slow (needing 3 uops), although they are better with BMI (shlx and friends) – but with BMI the whole two lines:
uint64_t t = bitset & -bitset;
bitset ^= t;
end up as a single BLSR instruction on compilers that recognize it. It would have been simpler to write the two lines above as the single line:
bitset = bitset & (bitset - 1);
Which is the usual idiom for “clearing the lowest bit” – although gcc was smart enough to compile both to the same form (and you don’t need the temporary and all that).
Thanks for the explanation!
I implemented both in Rust, and I am unsure the generated code ended up using a BLSR instruction. I’ll investigate what happened there.
The work done in my bench is small and inlinable so I am not too worried about the callback cost.
I had a look at the assembly code. Both version compile to the same assembly. BLSR is indeed used.
Link to godbolt:
https://godbolt.org/g/sT24rm
Thanks for this. Whether it is faster or not, it’s still worth mentioning this alternative because I believe in some languages, like C#, while you can do
long t = bitset & -bitset;
it won’t let you do
ulong t = bitset & -bitset;
I would be surprised if
could not be achieved with something
Depends on your use case. There’s some sets of data that tend to group whatever the bits represent to be all the same.
So you may find that there are situations where comparing to 0 or 0xFFFFFFFF tells you right away that the next 64 bits are all the same.
Can’t wait for 128 bit processors, this will work even faster.
There are optimizations to be had everywhere.
Did you start blocking Russian IPs by any chance, or is it me moving to Russia from the Netherlands? I know for a fact LinkedIn is blocked in Russia, but I wonder why would anyone want to block this blog :/
I am not blocking anyone, but it is possible that my blog is blocked in some countries. My blog was blocked in China and South Korea some years ago.
I am unsure why anyone would block my blog. I am just a computer scientist.
My exact thoughts, just wanted to confirm. Thank you for your reply.
The difference is so bit that it’s hard to believe in those results. Did you test those routines functionally?
There is one other way to write this loop:
What happens in your code when zeros is 63? You could add a branch to check if bitset is equal to 1<<63. Since this will almost never happen, the branch is going to be free (and you only need one branch per word).
The difference is so bit that it’s hard to believe in those results. Did you test those routines functionally?
The fast version (ctz) has been benchmarked thoroughly and is in production code.
The other version was proposed by Wojciech. It is my first time benchmarking it.
I offer my source code, it is quite short, so you should be able to check whether I made a mistake or whether my results are robust. I encourage re-examination of my results. Please prove me wrong!
This code is sensitive to branch prediction so your results will vary depending on the data. I used random data with uniform density. You should expect slower processing with realistic data having non-uniform density.
Or, it could be split into two instructions: bitset >>= zeros; bitset >>= 1. I think the previous three operations, especially callback (if it’s something non-trivial) have enough data parallelism to perform this extra operation with the same overall thoughput.
Another option is that the inferior alternative works worse than it could.
These results are not at all hard to believe and in fact you can ballpark from the results from first principles: consider that the naive approach will take a branch misprediction on at least every set bit, which bounds its performance from above at about 20 cycles per set bit. When the density gets lower it also wastes a lot of time examining every unset bit.
The ctz approach never examines unset bits and just jumps immediately to the next set bit, and only involves about 4 instructions per bit (plus the callback overhead). For really dense bitmaps you can get close to 1 cycle per int!
The main cost as density gets lower is branch prediction on the inner (bitmap != 0) loop: you’ll almost always mispredict the exit iteration and you can pretty much model the performance exactly as 1 mispredict divided by the average number of set bits in a 64-bit bitmap (until density gets so low that most bitmaps are zero: then you approach 1 mispredict per set bit).
You can play some tricks to avoid the mispredicts, such as conditional moves rather than nested loops – but it’s all highly dependent on the bit frequency and typical patterns. I settled on using various strategies, branching and not, depending on the frequency.
The fastest way is to use the expand instructions (ex. _mm256_maskz_expand_epi32) on AVX-512 or a shuffle/permute table for SSE/AVX2 and process mutiple (4/8/16) integers at a time with SIMD. This technique is used in the TurboPFor (https://github.com/powturbo/TurboPFor) package.
Thanks for your comment. I got so-so results in the past from similar vectorized code (some of which was written by Nathan Kurz), but I think I will make a blog post out of it. Crediting you of course, as needed.
I have hundreds of loops of the following form in my code:
uint32_t variant_uidx = 0;
// Iterate over all variant_ct set bits in variant_include, which is a bitmap of type const uintptr_t*.
for (uint32_t variant_idx = 0; variant_idx < variant_ct; ++variant_idx, ++variant_uidx) {
if (!IsSet(variant_include, variant_uidx)) {
variant_uidx = FindFirst1BitFrom(variant_include, variant_uidx);
}
// ...do stuff...
}
IsSet() and FindFirst1BitFrom() are currently defined on Linux and OS X as follows:
inline uintptr_t IsSet(const uintptr_t* bitarr, uintptr_t idx) {
return (bitarr[idx / kBitsPerWord] >> (idx % kBitsPerWord)) & 1;
}
uintptr_t FindFirst1BitFrom(const uintptr_t* bitarr, uintptr_t loc) {
const uintptr_t* bitarr_iter = &(bitarr[loc / kBitsPerWord]);
uintptr_t ulii = (*bitarr_iter) >> (loc % kBitsPerWord);
if (ulii) {
return loc + __builtin_ctzl(ulii);
}
do {
ulii = *(++bitarr_iter);
} while (!ulii);
return static_cast<uintptr_t>(bitarr_iter - bitarr) * kBitsPerWord + __builtin_ctzl(ulii);
}
I’ve been pretty happy with the performance of this idiom in both the common all-bits-are-1 case and the sparse case, but if you can find a better approach I’m all ears.