6th April 2023, 12 min read

Are your memory-bound benchmarking timings normally distributed?

14 thoughts on “Are your memory-bound benchmarking timings normally distributed?”

Mark E Dawson Jr says:

April 6, 2023 at 10:30 pm

Yes, this is why I generally use nonparametric statistical tests for my end-to-end performance benchmarks – e.g., Kolmogorov–Smirnov, Wasserstein Distance, etc.
Adrian Cockcroft says:

April 6, 2023 at 10:49 pm

Hi Daniel – thanks for raising this issue and proposing a better metric. One way of thinking about this is that normal distribution applies when the random errors are additive/subtractive, however for computer response times the errors are multiplicative. E.g. the system is slower so everything takes 10% longer rather than taking a fixed extra 100ms (or whatever), and that is the log normal model. I recently blogged about this – https://adrianco.medium.com/percentiles-dont-work-analyzing-the-distribution-of-response-times-for-web-services-ace36a6a2a19
Brandon Simmons says:

April 6, 2023 at 11:18 pm

It’s odd min seems not commonly used as a summary statistic, e.g. for performance regression reports. It’s intuitively an asymptote for any real workload: on a given machine an operation can get arbitrarily slow for any number of reasons (scheduling, contention for resources, hot cpu, etc etc), but it can only get so fast…

At hasura we get a histogram view of latencies and also do something which I think is quite useful: we overlay a histogram from just the first half of the samples taken (but with the counts doubled), with the full set of samples. This lets you see if the distribution is drifting over time, or suggests whether you should take more samples.
Malte Skarupke says:

April 7, 2023 at 1:34 am

I once heard in a talk that you should use the median. (I think the talk was by Andrei Alexandrescu)

With enough samples the median approaches the minimum, but it’s more robust to outliers.

Why would the minimum be an outlier? Because CPUs are good at predicting things, so you want randomness to undo the prediction. But if you randomly get a fast run, or the CPU randomly predicts everything right, you get an unrealistic measurement. (See Emery Berger’s talk “performance matters” on why you want randomness)

Why not use the average? Because you never actually measured that number. The talk “How not to measure latency” by Gil Tene gives lots of reasons for why you only want numbers that actually happened.
1. Daniel Lemire says:
  
  April 7, 2023 at 3:39 pm
  
  Because CPUs are good at predicting things, so you want randomness to undo the prediction.
  
  Though I am sure it happens, I have never seen, in my work, the minimum be an outlier.
  
  If you run something in a loop, and the branch predictor does well in one run, it is likely to do well in other runs.
  1. Malte Skarupke says:
    
    April 8, 2023 at 3:35 am
    
    I agree that the branch predictor is likely to do well in many runs if it does well in the minimum run. So the minimum won’t be an outlier because of the branch predictor.
    
    But I’ve had microbenchmarks that run fast, and when I make innocent changes and recompile, they run measurably slower. Why? Because they had randomly gotten a good layout, just like Emery Berger mentions in “performance matters”. It’s not a theoretical concern, it really has happened to me and I have wasted lots of time trying to understand why seemingly similar things have measurable differences.
    
    So at some point I turned on ASLR and changed my benchmarking code to run the executable from scratch for every N iterations of the benchmark. And immediately the differences from before disappeared. The code that I expected to have the same performance actually did have the same performance. But this only works if I use the median measured time. (which I had already done before) If I use the min instead, I’d find the one random run where ASLR gives an unusually good layout, and then I’d be confused again why two similar things have different performance.
Mihai Preda says:

April 7, 2023 at 6:41 am

A distribution that has a similar shape is the Gumbel distribution https://en.wikipedia.org/wiki/Gumbel_distribution
which models the distribution of the maximum (or minimum) of a set of samples with some underlying distribution.

IMO this “extreme value distribution” may naturally arise in some benchmarks — e.g. when the “task” that is measured lasts the duration of the longest of a number of subtasks, the duration of the subtasks being randomly distributed [according to some underlying distribution].

Just wandering how you decided it’s a lognormal you have vs. a Gumbel, because if it’s by eyeing, they both look pretty similar.
1. Daniel Lemire says:
  
  April 7, 2023 at 3:43 pm
  
  Just wandering how you decided it’s a lognormal
  
  My blog post says that in some cases, the log normal is a better model than the normal distribution (at a high level). My argument and methods do not rely on the exact nature of the distribution.
Antoine says:

April 7, 2023 at 8:44 am

The slidedeck is very interesting.

Question: your definition of a compute-bound benchmark is that the data fits in CPU cache. When designing a micro-benchmark, would you make a difference between fitting in L3 cache (which is typically shared between cores and may incur additional benchmark noise) and L1/L2 caches? Would you actively try to avoid the L3 (or last-level cache more generally)?
1. Daniel Lemire says:
  
  April 7, 2023 at 3:37 pm
  
  It is possible to design a routine that is bound by the speed of the L3 cache, while not memory-bound per se, but I don’t expect that to be a common occurence.
Lars Clausen says:

April 7, 2023 at 10:00 am

Why not use the median?
Janwillem Swalens says:

April 9, 2023 at 3:29 pm

I’ve also always learned that “the minimum is more sensitive to outliers”. However, I suppose there’s a difference between a microbenchmark and a larger benchmark: in a microbenchmark there’s naturally less variation, so the minimum is more “stable”, while in a larger benchmark you’re less likely to find the true minimum after a limited number of runs.

I’d also add that I’ve always considered the goal of statistics as a way to reduce many numbers (e.g. 1000s of individual measurements) to a few. However, trying to summarize performance results in just one or two numbers – whether it’s the minimum, median, or average & std dev – will always be too simplistic. Especially because distributions are often asymmetric. So an average and standard deviation doesn’t tell you much. Therefore I think it’s practically a necessity to use something like a box plot (or violin plot). Then you automatically get the minimum, median, etc, and an overview of the underlying distribution, which is much more informative.
1. Daniel Lemire says:
  
  April 9, 2023 at 9:14 pm
  
  I’ve also always learned that “the minimum is more sensitive to
  outliers”.
  
  Do you have an example in the context of a benchmark, where you are measuring the running time of a function and the minimum ends up being an outlier in the data?
  1. Martin Leitner-Ankerl says:
    
    September 17, 2023 at 8:40 am
    
    A simple example would be to benchmark random access in a unordered map. If you are really lucky, the random access will look up consecutive elements (or even the same element)