Daniel Lemire's blog

, 11 min read

Bitset decoding on Apple’s A12

13 thoughts on “Bitset decoding on Apple’s A12”

  1. aqrit says:

    “you have to reverse the bit order and use a “leading zero” instruction”

    Would it be possible to isolate the lowest set bit? x & (-x)
    The lzcnt of a “1-hot” should be just as good as a tzcnt?
    One could then xor/sub the isolated bit with the source word to clear the that bit.

    1. True: you do not need to reverse the bits, you can skin this cat some other way, but it is harder to do it while saving instructions.

    2. Wilco says:

      Using tmp = (bits – tmp) & (tmp – bits); bits = bits – tmp; for finding and clearing the next set bit should be faster (2 cycle latency) than the most obvious sequence.

      1. This seems to generate three instructions…

         lowest = (bits - lowest)
             & (lowest - bits);
        

        So that’s 3 in latency, no?

        Update: as pointed out by Travis, this has a total of 2 cycle of latency due to parallelism… but if we have anything else in the loop that updates bits, then we get to three cycles.

        1. Travis Downs says:

          The two subtractions are independent so can execute in parallel, so total latency 2.

          1. But it is followed by bits = bits – lowest (at least in how Wilco described it) and that depends on lowest.

            1. I guess that the idea is that the compiler should be able to merge all of this, into 3 instructions in total?

              1. Travis Downs says:

                I don’t think Wilco is imagining any merging (although x86 BMI does have a BLSI which does the x & -x in one instruction, latency 1 on Intel, but 2 on AMD).

                Yes, the chain from bits as input, to bits as output is 3 cycles here (assuming no merging):

                lowest = (bits - lowest) & (lowest - bits);
                bits = bits - lowest;

                However the unrolled code in question is something like:

                lowest = (bits - lowest) & (lowest - bits);
                result[i] = tz(lowest);
                bits2 = bits - lowest;
                lowest = (bits2) & (lowest - bits);
                result[i+1] = tz(lowest);
                bits = bits2 - lowest;
                lowest = (bits) & (lowest - bits2);
                ...

                The dependency chain is only 2 cycles for each result. Essentially the bits = bits - lowest is both the end of one result and the first part of the next.

                1. That works. I think you can code it like so…

                    lowest = 0
                    for (...) {
                        // we make a 'copy' of lowest, but it should not be compiled as a copy
                        uint64_t tmp = lowest;
                        // the next two line can execute at the same time
                        lowest = (lowest - bits);
                        bits = (bits - tmp);
                        // then we finish updating 'lowest', in a second cycle
                        lowest &= bits;
                        ... then use lowest to identify the bit location (with clz)
                    }
                  

                  It works but at least on my Apple M1, it is not particularly fast.

                  1. Travis Downs says:

                    Something like that. I am surprised it performs poorly. You may have to check the assembly to ensure the generated code is as you expect.

                    1. I did not write that it performed poorly.

                      I posted the numbers at https://github.com/simdjson/simdjson/pull/1546

                      It seems to exactly match the “rbit/clz for every bit” routine in terms of performance.

                      I have an isolated benchmark with instrumentation at…
                      https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/05/03

                      It is the same instruction count (roughly). But they all seem to max out at 4 instructions/cycle on the M1/clang 12.

  • Travis Downs says:

    I think at this point it is generally accepted that A12 has higher IPC than Skylake on most scalar code. High end Skylake models may still have higher overall single threaded performance than any A12 because they run at a higher frequency than the fastest A12s.

    Processor architects will tell you (even if you didn’t ask) that you can’t compare designs on the basis of IPC alone, because higher frequency designs may sacrifice IPC to get there, so you can’t really talk about an “A12 running at 5 GHz” because such a thing may not be possible (it would require a different design).

    To be certain, I would need to be able to measure the number of cycles elapsed directly in my code.

    A typical approach is to calibrate based on a measurement with a known timing. I usually use a series of dependent adds or a series of stores. Dependent adds take 1 cycle each, so by timing a long string of those you can calculate CPU frequency. Stores are similar: running at one per cycle (in L1) on recent Intel CPUs, but I think the A12 can do 2 per cycle! In any case, adds are probably easier: you’ll immediately see if you get a reasonable result or not. In my experience this technique done carefully gives you a very accurate calibration (i.e., far better than 1%) – able to even measure temperature related drift in clock timing!

    An iPad is a better target than an iPhone since it suffers less thermal throttling. Maybe put it in the freezer first :).

    Of course, I could get myself a top-notch Qualcomm or Samsung processor and install Linux on it, and I may end up doing just that.

    I would also be an interesting exercise, but my impression is the A12 is way ahead of the pack here.

    1. Processor architects will tell you (even if you didn’t ask) that you can’t compare designs on the basis of IPC alone, because higher frequency designs may sacrifice IPC to get there, so you can’t really talk about an “A12 running at 5 GHz” because such a thing may not be possible (it would require a different design).

      Well… I wasn’t planning on building servers out of A12 chips… yet.

      A typical approach is to calibrate based on a measurement with a known timing.

      Yes. I’ll do so in a later revision of this test.

      An iPad is a better target than an iPhone since it suffers less thermal throttling. Maybe put it in the freezer first :).

      I doubt that my benchmark is intensive enough to get in trouble with respect to heat to the point where using a freezer is warranted. It lasts a very short time (but long enough to trigger the big cores).