In large part, the “stability” of results is only an issue if your test is long enough to experience the various external things that affect the results, such as frequency transitions, interrupts, context switches, another process being scheduled on the sibling hyperthread, etc.
Then the suggestion “solution” is to run your test so many times that all this noise “averages out”. Much better is to remove these sources of influence, as much as possible. Very short tests are a start: you run the test enough times so it doesn’t (usually) receive any interrupt (and if it does you detect it and throw out the result).
Here are the other things which reduce the noise:
1) Disable hyperthreading: unless you specifically testing how some algorithm works when running on both hypercores, disabling hyperthreading reduced noise by an order of magnitude or more for me (for longer tests): the OS will sometimes schedule something on the sibling thread of the one running the benchmark which will start competing for all sorts of resources.
2) Disable turboboost. This is the biggest one and reduced test noise by two orders of magnitude. CPUs have max turbo ration that depend on the number of cores currently running. This means that any activity of *any* core during your benchmark will suddenly change the speed of your benchmark: both because the frequency changes and because there is a short “dead time” where all cores stop running entirely while the voltage/current stabilizes across the CPU. This one is really important because the main way that other activity on your system can interfere what what appears to be a “core local” benchmark.
3) Disable DVFS, aka “frequency scaling”. Even if TB is off, you’ll usually get sub-nominal frequency scaling to save power with the default settings with most OSes and CPU. This one doesn’t have the same effect as TB, since there is no cross-core “max turbo ratio” effect, and in fact you can ignore it with a sufficient long warmup (since the CPU will reach max frequency and stay there), but it removes another variable and avoids the need for a lengthy warmup period. If you ever see a weird peformance graph where the performance of the function seems to steady increase the more you run it it’s often frequency scaling messing up the results!
4) Use performance counters with rdpmc instruction to measure true cycles rather than wall time. This has the advantage of mostly being frequency independent, so even if you have some kind of frequency scaling you’ll get about the same number. It also lets you record only cycles in user-space if you want, avoiding counting system interrupts (but interrupts still perturb your results in other ways). It let’s you get results in cycles directly, rather than trying to convert based on what you guess is the CPU frequency, and avoids the timing routines which may be more complex than necessary and may not be constant time (and the overhead is highly OS dependent). With this approach you can get reproducible results down to a single cycle, without looping your test at all!
5) Various other things to “isolate” your CPU from the scheduler, such as setting cpu_isols or using realtime priority or tickless kernels. The idea here is to prevent the scheduler from interrupting your process. I did this, but haven’t actually found it necessary in practice for most benchmarks since as long as your test is very short the scheduler usually won’t interrupt it anyways (e.g., if your scheduler tick frequency is 250Hz, a test that runs for 1 us is almost never interrupted).
6) Use “delta” measurement. The simple way to do this is if you test has a measurement loop and times the entire loop, run it for N iterations and 2N, and then use (run2 – run1)/N as the time. This cancels out all the fixed overhead, such as the clock/rdpmc call, the call to your benchmark method, any loop setup overhead. It doesn’t cancel out the actual loop overhead though (e.g., increment and end-of-loop check) – even though that’s often small or zero. If you want to do that you need actually two separate loops with different numbers of calls (or inlined copies) of the code under test and apply delta. It’s kind of risky to do that due to alignment effects so the loop approach is usually fine.
7) Look at the “min” and “median” of repeated results. If you did everything right the min and median will usually be identical (or off by a few parts per million) +/- one cycle sometimes. Or just look at a histogram as Daniel suggests. It’s super obvious if your results are stable or not.
which is intended for short, very accurate tests of things that might reveal interesting architectural details (although I haven’t added a ton of tests yet) – but it can also just be used as a micro-benchmark framework to throw random tests into.
I agree with most of your suggestions — indeed Krun does 1–5 (inclusive). For very small benchmarks, which I think Daniel is also most interested in, I can well believe that this approach does give quite stable numbers. uarch-bench looks very interesting!
That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability. For example, we can’t, in my opinion, use performance counters for two reasons. First, because users don’t perceive performance in terms of such counters: they can only perceive wall clock time. Second, because performance counters only measure userspace costs, we might provide a perverse incentive to move benchmark overhead into the kernel by abusing various features.
An open question in my mind is: what all the potential sources of performance non-determinism in a system? I think we have something of a handle on the software sources (though I’m sure there are things I haven’t thought of), but, beyond a few obvious things (e.g. frequency scaling, caches) I have very little insight as to what performance non-determinism modern processors might introduce.
That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability.
I think we need proper terminology. There are many reasons to “benchmark” and somehow, we end up with a single word encompassing various different activities. In my view, a microbenchmark is something you design so as to better understand software behavior. A microbenchmark, to me, is a “pure” and ideal benchmark, with as much of the complexity taken out as possible. If whatever I describe does not sound like a “microbenchmark” to you, then I am more than willing to use another term.
Laurence Trattsays:
Hello Daniel,
Unfortunately microbenchmark is a heavily overloaded term already :/ I once tried to get some colleagues to use “picobenchmark” for very small benchmarks (i.e. a for loop with one statement inside), but failed. In retrospect, I’m not sure that name would have been hugely helpful; but, there again, I still don’t have any better ideas…
Laurie
Travis Downssays:
Yeah, it’s horses for courses. Performance counters are a tool for (a) repeatedly measuring performance very exactly and (b) diagnosing performance problems.
The (b) use is the common one: you want to know what your bottlenecks are and performance counters can tell you. I don’t think that use is controversial, and it’s useful to have them in a benchmarking framework since people tend to stick stuff in there they want to make faster.
I guess we are mostly talking about (a) though. You are right that users don’t perceive performance in terms of performance counters, but in terms of wall-clock time. Sometimes, however, you just use performance counters as a high-precision proxy for wall clock time. E.g., when tuning a very small loop you might want to immediately get cycle-accurate feedback on a single change. Doing this will a heavy wall-clock timer might introduce enough noise to make this difficult, but a cycle-accurate performance counter makes this easy.
Of course, you always need to go back and confirm the performance counter was in fact a good proxy for wall-clock time! It would be unfortunately if you were busy optimizing your perf-counter based metric and it turns out wall clock time was going in a different direction. Luckily, metrics like cycles have a direct relationship to wall clock so there aren’t many gotchas there.
Note that performance counters measure kernel time just fine (it’s configurable: you can measure U only, K only or both). Measuring only user time can be a “feature” here too: if you know your code is user-only you can remove some noise from interrupts by excluding K time, although getting rid of the interrupts would be even better.
About your last question, there aren’t _too_ many sources on non-determinism inside the CPUs themselves: although what you consider inside vs outside is up for debate (e.g., most interrupts are externally generated, but what about SMI?), at least in my experience. Many things are seem non-deterministic only because they are incompletely modeled (e.g., you often seen variance in caching performance from run-to-run, but this can just be caused by the varying virtual-to-phy mapping you get for your pages making L2 caching more or less effective – deterministic from the CPU point of view, but not from the process running in the OS).
So I think a practical list would probably have things that are truly deterministic, and also ones that may be deterministic if you can put your hardware into an exact initial state and avoid all external perturbation, but since you can’t do that, and butterly effect and all that, they end up looking non-deterministic.
In my experience the list above still isn’t too bad. The big ones are solved if you have consistent code and data alignment (including physical addresses, not just virtual), avoid hyperthreading and any contention for the core, keep your benchmark core-local, don’t experience any throttling conditions, and don’t get close to the ragged edge of any uarch resource (e.g., physical regs, ROB entries, LSD size, store buffers, etc, etc). Things then mostly work according to a reasonable model of processor performance.
LIJUAN PIsays:
Hi Professor Lemire.
Your benchmarking strategy (6) is very useful. It successfully stabilized my benchmark test results and it is so easy to implement. Thanks for your brilliant idea and posting it online.
Great article.
In large part, the “stability” of results is only an issue if your test is long enough to experience the various external things that affect the results, such as frequency transitions, interrupts, context switches, another process being scheduled on the sibling hyperthread, etc.
Then the suggestion “solution” is to run your test so many times that all this noise “averages out”. Much better is to remove these sources of influence, as much as possible. Very short tests are a start: you run the test enough times so it doesn’t (usually) receive any interrupt (and if it does you detect it and throw out the result).
Here are the other things which reduce the noise:
1) Disable hyperthreading: unless you specifically testing how some algorithm works when running on both hypercores, disabling hyperthreading reduced noise by an order of magnitude or more for me (for longer tests): the OS will sometimes schedule something on the sibling thread of the one running the benchmark which will start competing for all sorts of resources.
2) Disable turboboost. This is the biggest one and reduced test noise by two orders of magnitude. CPUs have max turbo ration that depend on the number of cores currently running. This means that any activity of *any* core during your benchmark will suddenly change the speed of your benchmark: both because the frequency changes and because there is a short “dead time” where all cores stop running entirely while the voltage/current stabilizes across the CPU. This one is really important because the main way that other activity on your system can interfere what what appears to be a “core local” benchmark.
3) Disable DVFS, aka “frequency scaling”. Even if TB is off, you’ll usually get sub-nominal frequency scaling to save power with the default settings with most OSes and CPU. This one doesn’t have the same effect as TB, since there is no cross-core “max turbo ratio” effect, and in fact you can ignore it with a sufficient long warmup (since the CPU will reach max frequency and stay there), but it removes another variable and avoids the need for a lengthy warmup period. If you ever see a weird peformance graph where the performance of the function seems to steady increase the more you run it it’s often frequency scaling messing up the results!
4) Use performance counters with rdpmc instruction to measure true cycles rather than wall time. This has the advantage of mostly being frequency independent, so even if you have some kind of frequency scaling you’ll get about the same number. It also lets you record only cycles in user-space if you want, avoiding counting system interrupts (but interrupts still perturb your results in other ways). It let’s you get results in cycles directly, rather than trying to convert based on what you guess is the CPU frequency, and avoids the timing routines which may be more complex than necessary and may not be constant time (and the overhead is highly OS dependent). With this approach you can get reproducible results down to a single cycle, without looping your test at all!
5) Various other things to “isolate” your CPU from the scheduler, such as setting cpu_isols or using realtime priority or tickless kernels. The idea here is to prevent the scheduler from interrupting your process. I did this, but haven’t actually found it necessary in practice for most benchmarks since as long as your test is very short the scheduler usually won’t interrupt it anyways (e.g., if your scheduler tick frequency is 250Hz, a test that runs for 1 us is almost never interrupted).
6) Use “delta” measurement. The simple way to do this is if you test has a measurement loop and times the entire loop, run it for N iterations and 2N, and then use (run2 – run1)/N as the time. This cancels out all the fixed overhead, such as the clock/rdpmc call, the call to your benchmark method, any loop setup overhead. It doesn’t cancel out the actual loop overhead though (e.g., increment and end-of-loop check) – even though that’s often small or zero. If you want to do that you need actually two separate loops with different numbers of calls (or inlined copies) of the code under test and apply delta. It’s kind of risky to do that due to alignment effects so the loop approach is usually fine.
7) Look at the “min” and “median” of repeated results. If you did everything right the min and median will usually be identical (or off by a few parts per million) +/- one cycle sometimes. Or just look at a histogram as Daniel suggests. It’s super obvious if your results are stable or not.
I use these approaches in uarch-bench:
https://github.com/travisdowns/uarch-bench
which is intended for short, very accurate tests of things that might reveal interesting architectural details (although I haven’t added a ton of tests yet) – but it can also just be used as a micro-benchmark framework to throw random tests into.
This is brilliant.
Hello Travis,
I agree with most of your suggestions — indeed Krun does 1–5 (inclusive). For very small benchmarks, which I think Daniel is also most interested in, I can well believe that this approach does give quite stable numbers. uarch-bench looks very interesting!
That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability. For example, we can’t, in my opinion, use performance counters for two reasons. First, because users don’t perceive performance in terms of such counters: they can only perceive wall clock time. Second, because performance counters only measure userspace costs, we might provide a perverse incentive to move benchmark overhead into the kernel by abusing various features.
An open question in my mind is: what all the potential sources of performance non-determinism in a system? I think we have something of a handle on the software sources (though I’m sure there are things I haven’t thought of), but, beyond a few obvious things (e.g. frequency scaling, caches) I have very little insight as to what performance non-determinism modern processors might introduce.
Laurie
That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability.
I think we need proper terminology. There are many reasons to “benchmark” and somehow, we end up with a single word encompassing various different activities. In my view, a microbenchmark is something you design so as to better understand software behavior. A microbenchmark, to me, is a “pure” and ideal benchmark, with as much of the complexity taken out as possible. If whatever I describe does not sound like a “microbenchmark” to you, then I am more than willing to use another term.
Hello Daniel,
Unfortunately microbenchmark is a heavily overloaded term already :/ I once tried to get some colleagues to use “picobenchmark” for very small benchmarks (i.e. a for loop with one statement inside), but failed. In retrospect, I’m not sure that name would have been hugely helpful; but, there again, I still don’t have any better ideas…
Laurie
Yeah, it’s horses for courses. Performance counters are a tool for (a) repeatedly measuring performance very exactly and (b) diagnosing performance problems.
The (b) use is the common one: you want to know what your bottlenecks are and performance counters can tell you. I don’t think that use is controversial, and it’s useful to have them in a benchmarking framework since people tend to stick stuff in there they want to make faster.
I guess we are mostly talking about (a) though. You are right that users don’t perceive performance in terms of performance counters, but in terms of wall-clock time. Sometimes, however, you just use performance counters as a high-precision proxy for wall clock time. E.g., when tuning a very small loop you might want to immediately get cycle-accurate feedback on a single change. Doing this will a heavy wall-clock timer might introduce enough noise to make this difficult, but a cycle-accurate performance counter makes this easy.
Of course, you always need to go back and confirm the performance counter was in fact a good proxy for wall-clock time! It would be unfortunately if you were busy optimizing your perf-counter based metric and it turns out wall clock time was going in a different direction. Luckily, metrics like cycles have a direct relationship to wall clock so there aren’t many gotchas there.
Note that performance counters measure kernel time just fine (it’s configurable: you can measure U only, K only or both). Measuring only user time can be a “feature” here too: if you know your code is user-only you can remove some noise from interrupts by excluding K time, although getting rid of the interrupts would be even better.
About your last question, there aren’t _too_ many sources on non-determinism inside the CPUs themselves: although what you consider inside vs outside is up for debate (e.g., most interrupts are externally generated, but what about SMI?), at least in my experience. Many things are seem non-deterministic only because they are incompletely modeled (e.g., you often seen variance in caching performance from run-to-run, but this can just be caused by the varying virtual-to-phy mapping you get for your pages making L2 caching more or less effective – deterministic from the CPU point of view, but not from the process running in the OS).
So I think a practical list would probably have things that are truly deterministic, and also ones that may be deterministic if you can put your hardware into an exact initial state and avoid all external perturbation, but since you can’t do that, and butterly effect and all that, they end up looking non-deterministic.
In my experience the list above still isn’t too bad. The big ones are solved if you have consistent code and data alignment (including physical addresses, not just virtual), avoid hyperthreading and any contention for the core, keep your benchmark core-local, don’t experience any throttling conditions, and don’t get close to the ragged edge of any uarch resource (e.g., physical regs, ROB entries, LSD size, store buffers, etc, etc). Things then mostly work according to a reasonable model of processor performance.
Hi Professor Lemire.
Your benchmarking strategy (6) is very useful. It successfully stabilized my benchmark test results and it is so easy to implement. Thanks for your brilliant idea and posting it online.
Strategy 6 should be credited to Travis. Thanks for the good words.