Daniel Lemire's blog

, 17 min read

Measuring the system clock frequency using loops (Intel and ARM)

24 thoughts on “Measuring the system clock frequency using loops (Intel and ARM)”

  1. Travis Downs says:

    I don’t think the 1 vs 2 cycle throughput is related to fusion, but rather a limitation on taken branches per cycle. Probably those CPUs cannot do a branch more than once every two cycles or some similar limitation.

    You could test this by unrolling the loop including the branch instruction but not taken for all by the last (make it beq), I think you get 1 cycle per sub/bne pair even on those CPU once you don’t need the loop to have a taken branch every cycle.

    Note that Intel CPUs fuse such pairs but also cannot do a taken branch every cycle except in special cases of very small displacement backwards branches, so fusion is no guarantee of one branch per cycle.

    1. I verified your assertion and it appears correct. The limitation might be in the number of branch taken per cycle.

  2. DL says:

    Another way is macros. You can generate a pretty big block of assembly while maintaining readable code. I’ve got versions with and without the bne. Sketch of the no bne approach:

    #define TEN (x) x x x x x x x x x x
    #define HUNDRED (x) TEN(TEN(x))
    #define THOUSAND (x) TEN(HUNDRED(x))

    getFrequency() {
    //start timer here
    for (i=0; i<1000000; i++) {
    //<prolog here>
    THOUSAND(asm("add x0, x0, x0"););
    }
    //stop timer here
    }

  3. Saagar Jha says:

    Instruments provides access to a set of performance counters, including ones such as INST_A64 and FIXED_CYCLES: perhaps these could be useful?

    1. Does it run in a simulator or on the device? If it can run on the device, then I want to know more.

      1. Saagar Jha says:

        This is for a physical device (I tried this on my iPad, which has an A10 “Fusion” processor, but I don’t see why it wouldn’t work on an A12); out of the events I tried, I was able to pull information out of FIXED_INSTRUCTIONS and FIXED_CYCLES. To use this, you can connect your device to a Mac and launch Instruments, selecting the “Counters” template and your device (and profiling app). Then go to File > Recording Options and click the + in the “Events and Formulas” section, and pick the events you want to measure from there. You should then be able to record your app: in mine, for example, I set it to run a three-instruction loop a billion times and I ended up with a little bit over 3 billion instructions executed total. I’m sure it’s possible to get more accurate results, but I was having issues getting the recording to work correctly if I didn’t call UIApplicationMain, which added overhead. Maybe you can rig up something better?

        1. That’s interesting but these two metrics are not very helpful if you are looking for the frequency, since neither of them tell you much about actual frequency I expect. Fixed cycle, I would guess, is just a measure of time elapsed. The instruction count is just… well, the instruction count.

          Now, if I had the real number of cycles, that’d be great. Combined with the instruction count, that gives me something useful. I need to look into it.

          1. Saagar Jha says:

            FIXED_CYCLES seems to vary with time in a similar manner to FIXED_INSTRUCTIONS, for what it’s worth, and telling Instruments to plot cycles per instruction seems to give something that looks reasonable.

          2. Travis Downs says:

            I think FIXED_CYCLES is CPU cycles, not a fixed real-time counter.

            Here “FIXED” refers to the fact the event can be counted by a dedicated fixed-function counter, rather than a programable one, and not that the cycle period (measured in time) is “fixed” or anything like that.

            1. Yeah. I am going to try to test it out.

              1. Ok. So FIXED_CYCLE varies over time and it seems to be highly correlated, visually, with the number of instructions per unit of time.

                Anyhow, FIXED_INSTRUCTION goes up to 29346274130 whereas FIXED_CYCLES is 9408002514 so that is 3.12 instructions per cycle. That’s for the whole program. It is much higher than on x64 where the highest you reach for part of the benchmark is 2.6 instructions per cycle.

                1. Saagar Jha says:

                  You may have found this already, but I though I’d mention it anyways: you can create your “formulas” in the view where you add the event counters by clicking on the gear instead of the + button. You can just type in Instructions / Cycles if that’s what you were trying to measure, which lets the computer do the work for you instead of you having to do the calculation manually 🙂

                  1. What I’d like even better is a handy way to export the data to a spreadsheet. 🙂

                    1. Saagar Jha says:

                      Uh, I think you can copy/paste things out of Instruments and get tab-separated data. As for getting the raw data out, I’m not sure: you might be able to write a tool that links against some of the frameworks inside of the Instruments app bundle to extract these.

  • For future reference, your instructions work, but, importantly, you have to indicate that you want to sample by time. Otherwise, if you sample by event, where event is 1,000,000 cycles, then you just record nothing (which I am sure makes sense, but is not explained anywhere).

    1. Saagar Jha says:

      Yeah, I saw that too where sampling by event didn’t seem to produce anything. Mine defaulted to time though so I forgot to mention it.

  • On Intel processors, there is the TSC register and instruction … which is pretty damned important. Not clear the current ARM CPUs have an equivalent. Worth spending a bit of thought on the how the TSC can be (very) useful.

    1. Wilco says:

      AArch64 has cntvct_el0 which is a fixed frequency counter (typically 50-100MHz) which is useful for accurate timing. Counters that vary with the rapid changing clock frequency are less useful to software.

      1. TSC also runs at a fixed frequency on modern Intel CPUs.

        1. I can verify this. Used the TSC to collect ultra-precise timing measurements from a custom Linux device driver. The tick rate is CPU specific (1600 MHz on the target box), and very steady. Has proved extremely useful.

      2. Steve Canon says:

        That depends a lot on what you’re actually measuring, and for what purpose. If you’re benchmarking an inner compute kernel for tuning with no dependencies on L2 or beyond, you want to isolate that measurement from thermal variation, frequency transients, memory/cache contention, etc. Cycles is great for that purpose.

        For tuning or comparing bigger systems, wall clock time (like these fixed-frequency counters provide) in often more meaningful (but beware coherency of such measurements between cores if a process migrates or has multiple threads; what you can count on differs across platforms).

        1. Wilco says:

          Indeed, cycles are useful when you’re optimizing small kernels. I typically use performance counters to get more detail when trying to figure out what is limiting performance.

          But at the end of the day the goal is to reduce total time taken of a complete application.

    2. Travis Downs says:

      Yeah, but the TSC (and ARM equivalents, I think) all count in “real time” not “CPU clock cycles” –
      that makes them directly useful for ehat most people want (real time measurement, time-stamping, etc), but not directly useful for counting cycles. Still they can serve as the real-time clock part of the calibration.

      I usually don’t bother because things like the std::chrono clocks and clock_gettime tend to use rdtsc under the covers so I just use the portable alternatives and get most of the rdtsc advantage.

      If you want to measure CPU clock cycles directly, you can on Intel but it takes a rdpmc and you have to program the performance counters, so it’s definitely a level of difficulty up, and less portable (eg on Windows I still haven’t seen a way to access the performance counters without a kernel driver).

      1. And I am still waiting for someone to teach me how to setup performance counters on my iPhone. (For all I know, it is possible but held a secret…)