, 2 min read
Counting cycles and instructions on ARM-based Apple systems
In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.
Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.
For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:
#include "performancecounters/event_counter.h"
event_collector collector;
void f() {
event_aggregate aggregate{};
for (size_t i = 0; i < repeat; i++) {
collector.start();
function(); // benchmark this function
event_count allocate_count = collector.end();
aggregate << allocate_count;
}
}
And then you can query the aggregate to get the average or best performance counters:
aggregate.elapsed_ns() // average time in ns
aggregate.instructions() // average # of instructions
aggregate.cycles() // average # of cycles
aggregate.best.elapsed_ns() // time in ns (best round)
aggregate.best.instructions() // # of instructions (best round)
aggregate.best.cycles() // # of cycles (best round)
I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:
# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
strtod 33.10 ns/float 428.06 instructions/float
75.32 cycles/float
5.68 instructions/cycle
fastfloat 9.53 ns/float 193.78 instructions/float
27.24 cycles/float
7.11 instructions/cycle