15th May 2019, 11 min read

Bitset decoding on Apple’s A12

13 thoughts on “Bitset decoding on Apple’s A12”

aqrit says:

May 16, 2019 at 12:06 am

“you have to reverse the bit order and use a “leading zero” instruction”

Would it be possible to isolate the lowest set bit? x & (-x)
The lzcnt of a “1-hot” should be just as good as a tzcnt?
One could then xor/sub the isolated bit with the source word to clear the that bit.
1. Daniel Lemire says:
  
  May 16, 2019 at 12:08 am
  
  True: you do not need to reverse the bits, you can skin this cat some other way, but it is harder to do it while saving instructions.
2. Wilco says:
  
  May 17, 2019 at 5:31 pm
  
  Using tmp = (bits – tmp) & (tmp – bits); bits = bits – tmp; for finding and clearing the next set bit should be faster (2 cycle latency) than the most obvious sequence.
  1. Daniel Lemire says:
    
    April 17, 2021 at 2:51 am
    This seems to generate three instructions…
```
 lowest = (bits - lowest)
     & (lowest - bits);
```
    So that’s 3 in latency, no?
    
    Update: as pointed out by Travis, this has a total of 2 cycle of latency due to parallelism… but if we have anything else in the loop that updates bits, then we get to three cycles.
    1. Travis Downs says:
      
      April 17, 2021 at 3:06 am
      
      The two subtractions are independent so can execute in parallel, so total latency 2.
      1. Daniel Lemire says:
        
        April 17, 2021 at 2:16 pm
        
        But it is followed by bits = bits – lowest (at least in how Wilco described it) and that depends on lowest.
        
        Daniel Lemire says:
        
        April 17, 2021 at 2:31 pm
        
        I guess that the idea is that the compiler should be able to merge all of this, into 3 instructions in total?
        
        Travis Downs says:
        
        April 17, 2021 at 9:57 pm
        
        I don’t think Wilco is imagining any merging (although x86 BMI does have a BLSI which does the x & -x in one instruction, latency 1 on Intel, but 2 on AMD).
        
        Yes, the chain from bits as input, to bits as output is 3 cycles here (assuming no merging):
        
        lowest = (bits - lowest) & (lowest - bits); bits = bits - lowest;
        
        However the unrolled code in question is something like:
        
        lowest = (bits - lowest) & (lowest - bits); result[i] = tz(lowest); bits2 = bits - lowest; lowest = (bits2) & (lowest - bits); result[i+1] = tz(lowest); bits = bits2 - lowest; lowest = (bits) & (lowest - bits2); ...
        
        The dependency chain is only 2 cycles for each result. Essentially the bits = bits - lowest is both the end of one result and the first part of the next.
        
        Daniel Lemire says:
        
        April 17, 2021 at 11:32 pm
        
        That works. I think you can code it like so…
        
        lowest = 0 for (...) { // we make a 'copy' of lowest, but it should not be compiled as a copy uint64_t tmp = lowest; // the next two line can execute at the same time lowest = (lowest - bits); bits = (bits - tmp); // then we finish updating 'lowest', in a second cycle lowest &= bits; ... then use lowest to identify the bit location (with clz) }
        
        It works but at least on my Apple M1, it is not particularly fast.
        
        Travis Downs says:
        
        April 18, 2021 at 12:20 am
        
        Something like that. I am surprised it performs poorly. You may have to check the assembly to ensure the generated code is as you expect.
        
        Daniel Lemire says:
        
        April 18, 2021 at 1:29 am
        
        I did not write that it performed poorly.
        
        I posted the numbers at https://github.com/simdjson/simdjson/pull/1546
        
        It seems to exactly match the “rbit/clz for every bit” routine in terms of performance.
        
        I have an isolated benchmark with instrumentation at…
        https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03
        
        It is the same instruction count (roughly). But they all seem to max out at 4 instructions/cycle on the M1/clang 12.