Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.
I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.
I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.
KWilletssays:
Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.
The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.
Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.
Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.
If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.
The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.
KWilletssays:
This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.
Also, forgot crucial variable in my earlier sketch :(.
I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.
It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.
BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):
PS: Sorry, it wasn’t naive, it was bitcan1, looked at the wrong column in the output.
KWilletssays:
I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.
I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.
Jeff Plaisancesays:
What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.
Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.
Claus Larsensays:
I realize that this post is a few years old, but I stumbled upon it in my quest for a faster nextSetBit(int fromIndex) on the Java BitSet.
Actually, my quest was for a faster way to do:
for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) {
And I wanted to share my new way of doing this, which on a dense set is about 4-5 times faster than the for loop above and on sets with about 50% density about 3 times faster. Results may vary, I have only done some simple testing.
interface BitSetCallback {
void nextSetBit(int i);
}
public void nextSetBitCallBack(BitSetCallback cb) {
long[] words = this.words;
int wordsInUse = words.length;
int u = 0;
int indexCounter = 0;
while (u < wordsInUse) {
long word = words[u];
while (word != 0) {
long idx = word & 1;
while (idx == 0) {
word >>>= 1;
++indexCounter;
idx = word & 1;
}
cb.nextSetBit(indexCounter);
word >>>= 1;
++indexCounter;
}
++u;
}
}
For the lab, why not just use a table for the rightmost 8 bits or so? Something like
while(val) {
lsb = table[val&255];
if( lsb < 9 ) {
output lsb-1;
val >>= lsb;
} else val >>= 8;
}
Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.
@KWillets
I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.
I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.
Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.
The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.
Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.
@KWillets
Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.
If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.
The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.
This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.
Also, forgot crucial variable in my earlier sketch :(.
void bits( int v, int *out, int *nout) {
// lowest-order 1 positions: 16-values, 2 bits each
int lsbs = (0<<0) + (0<<2) + (1<<4) + (0<<6) + (2<<8) + (0<<10) + (1<<12) + (0<<14)
+ (3<<16) + (0<<18) + (1<<20) + (0<<22) + (2<<24) + (0<<26) + (1<<28) + (0<<30)
;
int shift = 0;
while(v) {
int mask = v & 15;
int lsb;
if( mask ) {
lsb = (lsbs>>( mask<<1 )) & 3;
out[*nout] = shift + lsb;
*nout += 1;
}
else
lsb = 3;
v >>= (lsb + 1);
shift += (lsb + 1);
}
}
I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.
It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.
BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):
https://github.com/searchivarius/BlogCode/tree/master/2013/12/StupidMemoryBitscan
PS: Sorry, it wasn’t naive, it was bitcan1, looked at the wrong column in the output.
I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.
@KWillets
I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.
What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.
@Plaisance
The benchmarks were with Java 7 on a recent Intel core i7 processor.
Here is my exact output:
Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.
I realize that this post is a few years old, but I stumbled upon it in my quest for a faster nextSetBit(int fromIndex) on the Java BitSet.
Actually, my quest was for a faster way to do:
for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) {
And I wanted to share my new way of doing this, which on a dense set is about 4-5 times faster than the for loop above and on sets with about 50% density about 3 times faster. Results may vary, I have only done some simple testing.
interface BitSetCallback {
void nextSetBit(int i);
}
public void nextSetBitCallBack(BitSetCallback cb) {
long[] words = this.words;
int wordsInUse = words.length;
int u = 0;
int indexCounter = 0;
while (u < wordsInUse) {
long word = words[u];
while (word != 0) {
long idx = word & 1;
while (idx == 0) {
word >>>= 1;
++indexCounter;
idx = word & 1;
}
cb.nextSetBit(indexCounter);
word >>>= 1;
++indexCounter;
}
++u;
}
}