I just tried it on my side on a macbook air m1, and am getting way lower results for instructions/float (not sure what it means). I am running latest version of osx.
I found strange that you characterize 7.36 instructions by cycle as “close to 8”. Maybe you forgot to change this sentence when you updated your numbers?
(There is also a typo in “then the time elapsed in often not ideal”: i belive the in should be a is. Also earlier ” it is right measure” seems to be missing a “the”.)
Dougall Johnsonsays:
For context, 8 is the absolute maximum possible number for any combination of instructions. Sure, 7.36 is closer to seven, but 92% is really amazingly and surprisingly close to 100% of possible IPC for any real-world code.
Maynard Handleysays:
Also worth noting that what’s characterized as the “number of instructions” is, as far as I can tell, the number of DECODED instructions.
This is not exactly the same thing as the number of RETIRED instructions because of mis speculation. (I haven’t done enough testing to be certain, but I am pretty sure that counter setting (8c) increment in Decode, while the counter that’s locked as counter[1] is the number of Retired instructions.
Even putting speculation aside, the M1 does a fascinating job of splitting instructions for some purposes (primarily resource allocation where two registers are required like ldp, or a load or store with a pre/post increment) and then joining them again.
So for example LDP will count as
1 for Decode
2 for Map/Rename (allocate two registers)
1 for Execute
2 for Retire (have to deallocate the two registers)
Surprisingly many instructions can be performed at Map time (zero cycle moves, zero cycle immediates). A number of instructions that look like they would split (like ADDS) don’t because of a clever way of handling flags. A number of instructions that have to perform two tasks (like ADD(extend) ) split into to ops, but only require one register allocation because the temporary that’s generated is snarfed off the bypass bus, and never written out.
etc etc
The community is still figuring out all the details, but like so much else in computing, the simple models people have of “number of instructions executed” is not appropriate when you look closely; you have to be much more careful in exactly what you are asking, for what purpose.
Thanks. I am aware that the number of instructions is not a precise phrase, especially if you have speculative execution and fused/splitted instructions.
In my particular case, there is not much branch misprediction so it is not a good benchmark to test that effect.
Accessing counter[1] seems to give me the same numbers (or very close).
Dougall Johnsonsays:
Great post – glad some of that code has been useful!
Counters.app is the official way to access performance counters. I believe it can use a few more (non-whitelisted) events, which are described in /usr/share/kpep/a14.plist
This is pretty cool, but doesn’t seem to work on the M1 Pro. Any idea what needs to be done to make it work? My macbook returns 8 and 6 from kpc_get_counter_count and kpc_get_config_count respectively, but simply fixing those constants still causes kpc_get_thread_counters to fail (even with sudo).
Maynard Handleysays:
Ignacio Castano, if you want you can look at my code at https://github.com/name99-org/AArch64-Explore
and copy out the stuff that has to do with both wall-time recording and performance monitors. It definitely works on an MBA M1 and the most recent macOS.
Conceivably details may have changed for the Pro, Max, and Ultra? But there’s been no chatter about that on Twitter and such.
Mike Battagliasays:
As of 2023, I can’t seem to get this to work on a 2021 M1 MBP with an M1 max in it. I get the following (with sudo):
wrong fixed counters count
# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
strtod 0.00 instructions/float (+/- nan %)
0.05 cycles/float (+/- 95.9 %)
0.00 instructions/cycle
0.00 branches/float (+/- nan %)
0.0000 mis. branches/float
The M1 Max was made available at the end of October 2021. The blog post you are responding to was published in March 2021. I am pretty sure that when this blog post was written, the existence of the M1 Max wasn’t known outside of Apple.
We now have more complete code used in different projects. I will try to write a blog post about it.
Mike Battagliasays:
Ok, apologies for the confusion – if you do write a blog post would love to see how to get this up and running on M1 Max!
I just tried it on my side on a macbook air m1, and am getting way lower results for instructions/float (not sure what it means). I am running latest version of osx.
parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
strtod 376.04 instructions/float (+/- 0.0 %)
75.53 cycles/float (+/- 0.0 %)
4.98 instructions/cycle
88.95 branches/float (+/- 0.0 %)
0.6005 mis. branches/float
fastfloat 162.01 instructions/float (+/- 0.0 %)
22.01 cycles/float (+/- 0.0 %)
7.36 instructions/cycle
38.00 branches/float (+/- 0.0 %)
0.0001 mis. branches/float
Thanks a lot for the post. Very interesting.
I updated my blog post, my new numbers match your numbers. I had used a printout from an earlier version of my program.
Do you know if the perf Linux tool works on the M1s (or any Mac)? It’s very easy to inspect performance monitors with perf.
The perf Linux tools are tied to the Linux kernel as far as I know so I would not expect them to work when being directly under macOS.
A blog I wrote some time back on CPU frequency scaling, but that was for for a server: https://medium.com/@ferd/cpu-frequency-scaling-658ed502cba3.
Showing effects of thermals.
I found strange that you characterize 7.36 instructions by cycle as “close to 8”. Maybe you forgot to change this sentence when you updated your numbers?
(There is also a typo in “then the time elapsed in often not ideal”: i belive the in should be a is. Also earlier ” it is right measure” seems to be missing a “the”.)
For context, 8 is the absolute maximum possible number for any combination of instructions. Sure, 7.36 is closer to seven, but 92% is really amazingly and surprisingly close to 100% of possible IPC for any real-world code.
Also worth noting that what’s characterized as the “number of instructions” is, as far as I can tell, the number of DECODED instructions.
This is not exactly the same thing as the number of RETIRED instructions because of mis speculation. (I haven’t done enough testing to be certain, but I am pretty sure that counter setting (8c) increment in Decode, while the counter that’s locked as counter[1] is the number of Retired instructions.
Even putting speculation aside, the M1 does a fascinating job of splitting instructions for some purposes (primarily resource allocation where two registers are required like ldp, or a load or store with a pre/post increment) and then joining them again.
So for example LDP will count as
1 for Decode
2 for Map/Rename (allocate two registers)
1 for Execute
2 for Retire (have to deallocate the two registers)
Surprisingly many instructions can be performed at Map time (zero cycle moves, zero cycle immediates). A number of instructions that look like they would split (like ADDS) don’t because of a clever way of handling flags. A number of instructions that have to perform two tasks (like ADD(extend) ) split into to ops, but only require one register allocation because the temporary that’s generated is snarfed off the bypass bus, and never written out.
etc etc
The community is still figuring out all the details, but like so much else in computing, the simple models people have of “number of instructions executed” is not appropriate when you look closely; you have to be much more careful in exactly what you are asking, for what purpose.
Thanks. I am aware that the number of instructions is not a precise phrase, especially if you have speculative execution and fused/splitted instructions.
In my particular case, there is not much branch misprediction so it is not a good benchmark to test that effect.
Accessing counter[1] seems to give me the same numbers (or very close).
Great post – glad some of that code has been useful!
If it’s of interest, these performance events (and the whitelist for this API), are described by Apple at https://github.com/apple/darwin-xnu/blob/main/osfmk/arm64/kpc.c
Counters.app is the official way to access performance counters. I believe it can use a few more (non-whitelisted) events, which are described in /usr/share/kpep/a14.plist
(And, for my own measurements, I use a kernel module to bypass the whitelist, which is even more likely to blow up the computer, and definitely not recommended: https://github.com/dougallj/applecpu/tree/main/timer-hacks )
I’m surprised by the event numbers, they don’t match what the Arm Architecture Reference Manual lists (section D7.10).
Are they doing some internal remapping (perhaps to match Intel numbers)?
I’ve done some reverse engineer work on Xcode, kperf, kperfdata, and wrap the kpc APIs into some simple functions: https://github.com/ibireme/yybench/blob/master/src/yybench_perf.h
This is pretty cool, but doesn’t seem to work on the M1 Pro. Any idea what needs to be done to make it work? My macbook returns 8 and 6 from kpc_get_counter_count and kpc_get_config_count respectively, but simply fixing those constants still causes kpc_get_thread_counters to fail (even with sudo).
Ignacio Castano, if you want you can look at my code at
https://github.com/name99-org/AArch64-Explore
and copy out the stuff that has to do with both wall-time recording and performance monitors. It definitely works on an MBA M1 and the most recent macOS.
Conceivably details may have changed for the Pro, Max, and Ultra? But there’s been no chatter about that on Twitter and such.
As of 2023, I can’t seem to get this to work on a 2021 M1 MBP with an M1 max in it. I get the following (with sudo):
wrong fixed counters count
# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
strtod 0.00 instructions/float (+/- nan %)
0.05 cycles/float (+/- 95.9 %)
0.00 instructions/cycle
0.00 branches/float (+/- nan %)
0.0000 mis. branches/float
fastfloat 0.00 instructions/float (+/- nan %)
0.04 cycles/float (+/- 64.5 %)
0.00 instructions/cycle
0.00 branches/float (+/- nan %)
0.0000 mis. branches/float
Note that “wrong fixed counters count”. Is anyone else also getting this and what is the cause?
The code in this older blog post is only valid for the M1 processors.
This blog post was written in 2021, and as I said above, this a 2021 MacBook Pro with an M1 Max in it.
When you say “M1 Processors”, does that not include M1 Max?
The M1 Max was made available at the end of October 2021. The blog post you are responding to was published in March 2021. I am pretty sure that when this blog post was written, the existence of the M1 Max wasn’t known outside of Apple.
We now have more complete code used in different projects. I will try to write a blog post about it.
Ok, apologies for the confusion – if you do write a blog post would love to see how to get this up and running on M1 Max!
I did: https://lemire.me/blog/2023/03/21/counting-cycles-and-instructions-on-arm-based-apple-systems/