Mask and add is probably going to win in this situation, but aqrit’s comment (on github) made me realize that the codes (and their sums) are small enough to fit into nybbles so that a single shift, or, and mask (&) would get them into place for a multiply to add them all up. He uses an initial multiply to do that, but (x << 10 | x ) & 0x33033 would do the same.
Multiplies are good when the number of desired shifts and bitwise or/add's is high, but individual shifts should be considered otherwise.
The other fun thing about multiply is that it can give prefix sum for regularly-spaced small operands, so eg if we have numbers in byte spacing, like 0x04030201, multiplying by 0x01010101 gives 0x0A060301, which is the (little-endian) prefix sum of the bytes. Again, this all relies on zero-padding to keep one byte from carrying into the next, but for small values it's very handy.
v = Long.bitCount(v) + Long.bitCount(v & 0xaaaaaaaaaaaaaaaaL)
Nathan Kurzsays:
I’m more positive to the “lookup” approach than you seem to be. You’re right that the “batch” approach is going to scale better in situations where there is a simple algorithmic transformation from key to value, but a table gives you the flexibility of arbitrary (or changing) mappings with surprisingly good performance.
I think the limitations of the compiled code conceals just how fast the lookups can be. On Skylake, you can sustain 2 lookups per cycle, so if we are loading one byte of input and then doing one table lookup per byte, the “actual” speed for a table lookup should be closer to 1.0 cycles per byte.
But since we can “batch” the reading of the input by reading more than one byte at a time and then extracting the individual keys, we can actually get well under 1 cycle per byte (2 table lookups per cycle, plus some smaller overhead for reading and extracting input).
Here’s what I see on Skylake for a 4096 byte input:
To be fair, for this case where easy transformations from bit pattern key to lookup value exist, an “in-register” approach is going to be even faster. If you add “-march=native” to your compilation options, you can see the start of the effect of vectorization. The 64-bit popcnt() variant should be able to get down to .25 cycles per byte, and the AVX2 popcnt() analogue can go even lower (although I haven’t tested these two).
So yes, for this particular problem (and any problem where the transformation is easily vectorized), lookups are probably a dead-end (unless you can make the key substantially wider than 8-bits?) But I’m generally more excited by the universality of the gather-based lookup. It’s quite fast, easy to understand, and far more flexible.
Here’s a though-exercise: In practice, is the L1 cache significantly different from an enormous vector register split into cacheline sized lanes? Some of the slower AVX operations are already slower than he 4 cycle latency to L1, so I don’t think it’s just a matter of speed.
aqritsays:
minor improvements:
// batch
v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
v = (v + (v >> 4)) & 0x0F0F0F0F;
// Thomas Mueller 8-bit variant
x = Integer.bitCount(((x << 8) | (x & 0xAA)));
Macksays:
Hi aqrit , did you have a solution work for each 3 bits, example:
0b101100101101= 0b101+0b100+0b101+0b101=19,
for above example , how to calculate?
Macksays:
Hi, how about calculate each 3bits , example :
0b101100101101=0b101+0100+0101+0101=19,
any one know how to do?Thanks
Mask and add is probably going to win in this situation, but aqrit’s comment (on github) made me realize that the codes (and their sums) are small enough to fit into nybbles so that a single shift, or, and mask (&) would get them into place for a multiply to add them all up. He uses an initial multiply to do that, but (x << 10 | x ) & 0x33033 would do the same.
Multiplies are good when the number of desired shifts and bitwise or/add's is high, but individual shifts should be considered otherwise.
The other fun thing about multiply is that it can give prefix sum for regularly-spaced small operands, so eg if we have numbers in byte spacing, like 0x04030201, multiplying by 0x01010101 gives 0x0A060301, which is the (little-endian) prefix sum of the bytes. Again, this all relies on zero-padding to keep one byte from carrying into the next, but for small values it's very handy.
I’m missing the comparison with the various popcnt options, for example the 64-bit variant:
v = Long.bitCount(v & 0x5555555555555555L) +
(Long.bitCount(v & 0xaaaaaaaaaaaaaaaaL) << 1);
or the 8-bit variant with one popcnt:
v = Integer.bitCount(((v << 8) | v) & 0xaaff);
Yes. This is nice. I have added it (with credit to you) to the code.
Sorry… the 64-bit variant can be simplified to
v = Long.bitCount(v) + Long.bitCount(v & 0xaaaaaaaaaaaaaaaaL)
I’m more positive to the “lookup” approach than you seem to be. You’re right that the “batch” approach is going to scale better in situations where there is a simple algorithmic transformation from key to value, but a table gives you the flexibility of arbitrary (or changing) mappings with surprisingly good performance.
I think the limitations of the compiled code conceals just how fast the lookups can be. On Skylake, you can sustain 2 lookups per cycle, so if we are loading one byte of input and then doing one table lookup per byte, the “actual” speed for a table lookup should be closer to 1.0 cycles per byte.
But since we can “batch” the reading of the input by reading more than one byte at a time and then extracting the individual keys, we can actually get well under 1 cycle per byte (2 table lookups per cycle, plus some smaller overhead for reading and extracting input).
Here’s what I see on Skylake for a 4096 byte input:
bitsum_c_loop(input, count): 1.342 (best) 1.346 (avg)
bitsum_asm_loop(input, count): 1.007 (best) 1.011 (avg)
bitsum_asm_rotate(input, count): 0.827 (best) 0.839 (avg)
bitsum_asm_gather(input, count): 0.635 (best) 0.637 (avg)
To be fair, for this case where easy transformations from bit pattern key to lookup value exist, an “in-register” approach is going to be even faster. If you add “-march=native” to your compilation options, you can see the start of the effect of vectorization. The 64-bit popcnt() variant should be able to get down to .25 cycles per byte, and the AVX2 popcnt() analogue can go even lower (although I haven’t tested these two).
So yes, for this particular problem (and any problem where the transformation is easily vectorized), lookups are probably a dead-end (unless you can make the key substantially wider than 8-bits?) But I’m generally more excited by the universality of the gather-based lookup. It’s quite fast, easy to understand, and far more flexible.
Here’s a though-exercise: In practice, is the L1 cache significantly different from an enormous vector register split into cacheline sized lanes? Some of the slower AVX operations are already slower than he 4 cycle latency to L1, so I don’t think it’s just a matter of speed.
minor improvements:
// batch
v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
v = (v + (v >> 4)) & 0x0F0F0F0F;
// Thomas Mueller 8-bit variant
x = Integer.bitCount(((x << 8) | (x & 0xAA)));
Hi aqrit , did you have a solution work for each 3 bits, example:
0b101100101101= 0b101+0b100+0b101+0b101=19,
for above example , how to calculate?
Hi, how about calculate each 3bits , example :
0b101100101101=0b101+0100+0101+0101=19,
any one know how to do?Thanks