Daniel Lemire's blog

, 15 min read

Counting cycles and instructions on the Apple M1 processor

21 thoughts on “Counting cycles and instructions on the Apple M1 processor”

  1. Mohit says:

    I just tried it on my side on a macbook air m1, and am getting way lower results for instructions/float (not sure what it means). I am running latest version of osx.

    parsing random numbers

    model: generate random numbers uniformly in the interval [0.000000,1.000000]
    volume: 10000 floats
    volume = 0.0762939 MB
    strtod 376.04 instructions/float (+/- 0.0 %)
    75.53 cycles/float (+/- 0.0 %)
    4.98 instructions/cycle
    88.95 branches/float (+/- 0.0 %)
    0.6005 mis. branches/float
    fastfloat 162.01 instructions/float (+/- 0.0 %)
    22.01 cycles/float (+/- 0.0 %)
    7.36 instructions/cycle
    38.00 branches/float (+/- 0.0 %)
    0.0001 mis. branches/float

    Thanks a lot for the post. Very interesting.

    1. I updated my blog post, my new numbers match your numbers. I had used a printout from an earlier version of my program.

  2. Marshall Ward says:

    Do you know if the perf Linux tool works on the M1s (or any Mac)? It’s very easy to inspect performance monitors with perf.

    1. The perf Linux tools are tied to the Linux kernel as far as I know so I would not expect them to work when being directly under macOS.

  3. Frank Astier says:

    A blog I wrote some time back on CPU frequency scaling, but that was for for a server: https://medium.com/@ferd/cpu-frequency-scaling-658ed502cba3.

    1. Frank Astier says:

      Showing effects of thermals.

  4. Pierre B. says:

    I found strange that you characterize 7.36 instructions by cycle as “close to 8”. Maybe you forgot to change this sentence when you updated your numbers?

    (There is also a typo in “then the time elapsed in often not ideal”: i belive the in should be a is. Also earlier ” it is right measure” seems to be missing a “the”.)

    1. Dougall Johnson says:

      For context, 8 is the absolute maximum possible number for any combination of instructions. Sure, 7.36 is closer to seven, but 92% is really amazingly and surprisingly close to 100% of possible IPC for any real-world code.

      1. Maynard Handley says:

        Also worth noting that what’s characterized as the “number of instructions” is, as far as I can tell, the number of DECODED instructions.
        This is not exactly the same thing as the number of RETIRED instructions because of mis speculation. (I haven’t done enough testing to be certain, but I am pretty sure that counter setting (8c) increment in Decode, while the counter that’s locked as counter[1] is the number of Retired instructions.

        Even putting speculation aside, the M1 does a fascinating job of splitting instructions for some purposes (primarily resource allocation where two registers are required like ldp, or a load or store with a pre/post increment) and then joining them again.
        So for example LDP will count as
        1 for Decode
        2 for Map/Rename (allocate two registers)
        1 for Execute
        2 for Retire (have to deallocate the two registers)

        Surprisingly many instructions can be performed at Map time (zero cycle moves, zero cycle immediates). A number of instructions that look like they would split (like ADDS) don’t because of a clever way of handling flags. A number of instructions that have to perform two tasks (like ADD(extend) ) split into to ops, but only require one register allocation because the temporary that’s generated is snarfed off the bypass bus, and never written out.
        etc etc

        The community is still figuring out all the details, but like so much else in computing, the simple models people have of “number of instructions executed” is not appropriate when you look closely; you have to be much more careful in exactly what you are asking, for what purpose.

        1. Thanks. I am aware that the number of instructions is not a precise phrase, especially if you have speculative execution and fused/splitted instructions.

          In my particular case, there is not much branch misprediction so it is not a good benchmark to test that effect.

          Accessing counter[1] seems to give me the same numbers (or very close).

  5. Dougall Johnson says:

    Great post – glad some of that code has been useful!

    If it’s of interest, these performance events (and the whitelist for this API), are described by Apple at https://github.com/apple/darwin-xnu/blob/main/osfmk/arm64/kpc.c

    Counters.app is the official way to access performance counters. I believe it can use a few more (non-whitelisted) events, which are described in /usr/share/kpep/a14.plist

    (And, for my own measurements, I use a kernel module to bypass the whitelist, which is even more likely to blow up the computer, and definitely not recommended: https://github.com/dougallj/applecpu/tree/main/timer-hacks )

    1. Laurent says:

      I’m surprised by the event numbers, they don’t match what the Arm Architecture Reference Manual lists (section D7.10).

      Are they doing some internal remapping (perhaps to match Intel numbers)?

  6. ibireme says:

    I’ve done some reverse engineer work on Xcode, kperf, kperfdata, and wrap the kpc APIs into some simple functions: https://github.com/ibireme/yybench/blob/master/src/yybench_perf.h

  7. Ignacio Castano says:

    This is pretty cool, but doesn’t seem to work on the M1 Pro. Any idea what needs to be done to make it work? My macbook returns 8 and 6 from kpc_get_counter_count and kpc_get_config_count respectively, but simply fixing those constants still causes kpc_get_thread_counters to fail (even with sudo).

    1. Maynard Handley says:

      Ignacio Castano, if you want you can look at my code at
      https://github.com/name99-org/AArch64-Explore
      and copy out the stuff that has to do with both wall-time recording and performance monitors. It definitely works on an MBA M1 and the most recent macOS.

      Conceivably details may have changed for the Pro, Max, and Ultra? But there’s been no chatter about that on Twitter and such.

  8. Mike Battaglia says:

    As of 2023, I can’t seem to get this to work on a 2021 M1 MBP with an M1 max in it. I get the following (with sudo):

    wrong fixed counters count
    # parsing random numbers
    model: generate random numbers uniformly in the interval [0.000000,1.000000]
    volume: 10000 floats
    volume = 0.0762939 MB
    strtod 0.00 instructions/float (+/- nan %)
    0.05 cycles/float (+/- 95.9 %)
    0.00 instructions/cycle
    0.00 branches/float (+/- nan %)
    0.0000 mis. branches/float

    fastfloat 0.00 instructions/float (+/- nan %)
    0.04 cycles/float (+/- 64.5 %)
    0.00 instructions/cycle
    0.00 branches/float (+/- nan %)
    0.0000 mis. branches/float

    Note that “wrong fixed counters count”. Is anyone else also getting this and what is the cause?

    1. The code in this older blog post is only valid for the M1 processors.

  9. Mike Battaglia says:

    This blog post was written in 2021, and as I said above, this a 2021 MacBook Pro with an M1 Max in it.

    When you say “M1 Processors”, does that not include M1 Max?

    1. The M1 Max was made available at the end of October 2021. The blog post you are responding to was published in March 2021. I am pretty sure that when this blog post was written, the existence of the M1 Max wasn’t known outside of Apple.

      We now have more complete code used in different projects. I will try to write a blog post about it.

      1. Mike Battaglia says:

        Ok, apologies for the confusion – if you do write a blog post would love to see how to get this up and running on M1 Max!