Daniel Lemire's blog

, 12 min read

Are your memory-bound benchmarking timings normally distributed?

14 thoughts on “Are your memory-bound benchmarking timings normally distributed?”

  1. Yes, this is why I generally use nonparametric statistical tests for my end-to-end performance benchmarks – e.g., Kolmogorov–Smirnov, Wasserstein Distance, etc.

  2. Hi Daniel – thanks for raising this issue and proposing a better metric. One way of thinking about this is that normal distribution applies when the random errors are additive/subtractive, however for computer response times the errors are multiplicative. E.g. the system is slower so everything takes 10% longer rather than taking a fixed extra 100ms (or whatever), and that is the log normal model. I recently blogged about this – https://adrianco.medium.com/percentiles-dont-work-analyzing-the-distribution-of-response-times-for-web-services-ace36a6a2a19

  3. It’s odd min seems not commonly used as a summary statistic, e.g. for performance regression reports. It’s intuitively an asymptote for any real workload: on a given machine an operation can get arbitrarily slow for any number of reasons (scheduling, contention for resources, hot cpu, etc etc), but it can only get so fast…

    At hasura we get a histogram view of latencies and also do something which I think is quite useful: we overlay a histogram from just the first half of the samples taken (but with the counts doubled), with the full set of samples. This lets you see if the distribution is drifting over time, or suggests whether you should take more samples.

  4. I once heard in a talk that you should use the median. (I think the talk was by Andrei Alexandrescu)

    With enough samples the median approaches the minimum, but it’s more robust to outliers.

    Why would the minimum be an outlier? Because CPUs are good at predicting things, so you want randomness to undo the prediction. But if you randomly get a fast run, or the CPU randomly predicts everything right, you get an unrealistic measurement. (See Emery Berger’s talk “performance matters” on why you want randomness)

    Why not use the average? Because you never actually measured that number. The talk “How not to measure latency” by Gil Tene gives lots of reasons for why you only want numbers that actually happened.

    1. Because CPUs are good at predicting things, so you want randomness to undo the prediction.

      Though I am sure it happens, I have never seen, in my work, the minimum be an outlier.

      If you run something in a loop, and the branch predictor does well in one run, it is likely to do well in other runs.

      1. I agree that the branch predictor is likely to do well in many runs if it does well in the minimum run. So the minimum won’t be an outlier because of the branch predictor.

        But I’ve had microbenchmarks that run fast, and when I make innocent changes and recompile, they run measurably slower. Why? Because they had randomly gotten a good layout, just like Emery Berger mentions in “performance matters”. It’s not a theoretical concern, it really has happened to me and I have wasted lots of time trying to understand why seemingly similar things have measurable differences.

        So at some point I turned on ASLR and changed my benchmarking code to run the executable from scratch for every N iterations of the benchmark. And immediately the differences from before disappeared. The code that I expected to have the same performance actually did have the same performance. But this only works if I use the median measured time. (which I had already done before) If I use the min instead, I’d find the one random run where ASLR gives an unusually good layout, and then I’d be confused again why two similar things have different performance.

  5. Mihai Preda says:

    A distribution that has a similar shape is the Gumbel distribution https://en.wikipedia.org/wiki/Gumbel_distribution
    which models the distribution of the maximum (or minimum) of a set of samples with some underlying distribution.

    IMO this “extreme value distribution” may naturally arise in some benchmarks — e.g. when the “task” that is measured lasts the duration of the longest of a number of subtasks, the duration of the subtasks being randomly distributed [according to some underlying distribution].

    Just wandering how you decided it’s a lognormal you have vs. a Gumbel, because if it’s by eyeing, they both look pretty similar.

    1. Just wandering how you decided it’s a lognormal

      My blog post says that in some cases, the log normal is a better model than the normal distribution (at a high level). My argument and methods do not rely on the exact nature of the distribution.

  6. Antoine says:

    The slidedeck is very interesting.

    Question: your definition of a compute-bound benchmark is that the data fits in CPU cache. When designing a micro-benchmark, would you make a difference between fitting in L3 cache (which is typically shared between cores and may incur additional benchmark noise) and L1/L2 caches? Would you actively try to avoid the L3 (or last-level cache more generally)?

    1. It is possible to design a routine that is bound by the speed of the L3 cache, while not memory-bound per se, but I don’t expect that to be a common occurence.

  7. Lars Clausen says:

    Why not use the median?

  8. I’ve also always learned that “the minimum is more sensitive to outliers”. However, I suppose there’s a difference between a microbenchmark and a larger benchmark: in a microbenchmark there’s naturally less variation, so the minimum is more “stable”, while in a larger benchmark you’re less likely to find the true minimum after a limited number of runs.

    I’d also add that I’ve always considered the goal of statistics as a way to reduce many numbers (e.g. 1000s of individual measurements) to a few. However, trying to summarize performance results in just one or two numbers – whether it’s the minimum, median, or average & std dev – will always be too simplistic. Especially because distributions are often asymmetric. So an average and standard deviation doesn’t tell you much. Therefore I think it’s practically a necessity to use something like a box plot (or violin plot). Then you automatically get the minimum, median, etc, and an overview of the underlying distribution, which is much more informative.

    1. I’ve also always learned that “the minimum is more sensitive to
      outliers”.

      Do you have an example in the context of a benchmark, where you are measuring the running time of a function and the minimum ends up being an outlier in the data?

      1. A simple example would be to benchmark random access in a unordered map. If you are really lucky, the random access will look up consecutive elements (or even the same element)