Developers often believe that software performance follows a Pareto distribution: 80% of the running time is spent in 20% of the code. Using this model, you can write most of your code without any care for performance and focus on the narrow pieces of code that are performance sensitive.
This also ignores nonlocal effects. It is entirely possible for all of the following to be true simultaneously:
80% of the time is spent in 20% of the code.
There is not much more optimization to be squeezed out of said 20% of the code.
The code can be optimized significantly.
If I write a piece of code that, for instance, temporarily grabs a huge amount of memory and causes the OS to drop most of its page cache as a result, said piece of code can be “fast” and thus not really show up on profiles… but its overall effect on the program can be huge.
This sort of aggressor/victim issue shows up all over the place: cpu cache, branch prediction, TLBs, OS page cache (and paging in general), texture cache, disk accesses, atomic contention, hyperthreading, JITter deoptimizations, etc, etc.
I’ve seen too much code that is far slower than it could be, because the authors have effectively given up on optimizing it further, in turn because they’ve heavily optimized the “hot” section and ignored the “cold” sections, missing that the reason the “hot” section is slow is that it is a victim of the “cold” sections.
Ruben Vordermansays:
I once made a custom hash table implementation as a python function. I used python’s c-api to request a function pointer to pythpn’s builtin hash function (SipHash). This performed well enough, but I found that using XXHash instead made the function even faster.
Then I started to benchmark the entire program rather than just the function, and the XXHash implementation made it slower. I think this is because python uses its own hash quite a lot (lots of things are python dictionaries under the hood) which kept the has in the L1 instruction cache. By the time my custom hash table function came around, fetching the instruction sequence for XXHash was slower than using python’s builtin hash.
So this is exactly as nonnull describes.
Another example is where I benchmarked a python script that did not take into account cache locality by operating one large lists. The slowness did not show up in the profiler. The script took 26 seconds but the cumulative time of all the functions profiled was less than 8. So I rewrote it to perform all the transformations on one item at the time using iterators. That made it 10 times faster.
Egonsays:
I like to think of hotspot optimization as a way of getting closer to the local optimum; however not the global optimum.
One important thing about hotspot optimization is that the initial 2x wins are often trivial changes — and the sad part is that people don’t do those initial optimizations. The “trivial thing” is usually something like, “Oh, you forgot to configure your database connection pooling to allow multiple connections”. It’s also often not because they don’t know how to do the trivial optimization, but because they haven’t realized that there is a problem in a first place.
But as you’ve said, there’s only so far you can go with them. If I don’t see an obvious place in the profile to make the small fix, then I’ll stop.
Leesays:
“Premature optimization is the root of all” evil is sometimes attributed to Hoare, but it was Knuth.
A fellow academic and a long time reader/fan of your blog here.
The code that handles errors could be quite slow and it would not impact most of your users.
This is a common belief/recommendation that “optimizing the error path is unnecessary because it rarely happens”. Our research shows that slow error paths can lead to what we call “metastable failures”. Please see this paper for a high-level description of these failures, specifically section 2.3, and this paper for in-the-wild examples. Here’s a public example (also cited in the 2nd paper) of a case where slow error path caused a metastable failure: https://engineering.atspotify.com/2013/06/incident-management-at-spotify/
A good idea is decoupling the 80/20 rule from the conclusion that change must be applied to that hot 20% of code.
Another is to broaden the 20/80 rule to data, rather than just code. Profilers can help by highlighting hot data.
jerchsays:
Have been confronted with the “premature optimization…” cite more than once as a sloppy excuse for weak code, worst case was (me) “you have there a quadratic runtime, which can be done in linear” – “yeah, already figured, but it was more straight forward to code it this way, and btw premature optimization…” – Haha, well the problem was sleeping for 2 months before it shot back, just when the first real data got processed…
I always understood Knuth’s cite more as a warning not to go down the rabbit hole chasing the last <5% of optimization, which gets harder and harder if the lowing hanging fruits are already fixed. But using it as an excuse not to think about a better approach in the first place is imho just wrong.
Sometimes I wonder if CS courses should emphasize bare metal mechs (computer architecture) & complexity more again, as the knowledge, what a machine really has to do about the data, seems to get thinner. Ofc this is only anecdotical evidence from my side, but it got more of an issue in our local university, when they moved from C to Java as their introduction programming course. “The GC will fix that for me” is already pretty close to “dont make me think”. Oh wait there is ChatGPT for the rescue – isn’t it? 😉
Sean Jensen-Greysays:
The quote
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
The quote will more context
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
After working with such tools for seven years, I’ve become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.
The Knuth performance quote gets abused almost as often as programmers optimize the wrong things. It doesn’t look like Knuth is arguing against architectural efficiency or thinking about performance. The context is optimizing existing code needlessly w/o measuring. And that measuring should be a critical part of developing a system, so much so that it should be always-on.
nonnullsays:
The context is optimizing existing code needlessly w/o measuring.
The fatal flaw in this approach is that you can have regions of inexpensive code that nevertheless significantly slow down the program overall. (For instance, forcing the OS to drop its page cache.) I talk about this more in my prior comment.)
If you require prior measurement of “need” to optimize a piece of code, you’ll miss things like this and end up in a penny-wise-pound-foolish local optimum, where everything that takes a lot of time has been overoptimized, and yet you are still slow.
jerchsays:
The fatal flaw in this approach is that you can have regions of inexpensive code that nevertheless significantly slow down the program overall.
Would extend this issue to the whole (operating) system – in the end everything is fighting over register usage, cache lines, mem pages and so on, and the kernel tries to balance things out from its schedulers. But with wicked code you can raise the kernels runtime significantly (e.g. by sheer syscall pressure or lousy memory handling) – and the whole system may suffer.
The interesting bit here is the fact, that nowadays performance concerns are much more IO-related, while in the past the processing speed of the CPUs was a big limiting factor, too. We kinda put CPUs on steroids, but somehow the peripherals did not keep up. Which puts a major burden on clever resource management on kernel side.
The interesting bit here is the fact, that nowadays performance concerns are much more IO-related, while in the past the processing speed of the CPUs was a big limiting factor
My impression is that CPU performance has long stagnated due to the failure of Dennard scaling, and the repeated failures of the leader (Intel). However, we kept getting more bandwidth.
In the last decade, our disk IO was multiplied by at least 10. My PlayStation 5 has a 5 GB/s disk. The network speeds also improved by a similar factor.
Our single-processor performance surely did not improve in a similar manner after controlling for prices. That is, you can get a 32-core Intel PC today that will be worth about 10 CPUs from ten years ago, but it is going to be expensive. Of course, GPUs came to the rescue, but GPUs are specialized devices.
jerchsays:
My impression is that CPU performance has long stagnated due to the failure of Dennard scaling, and the repeated failures of the leader (Intel). However, we kept getting more bandwidth.
Yes true, just looking at the last 10ys we had bigger shifts on IO side. IO lost most of its ground in the 90s and early 2000s and has still way to go.
To me it seems, that most of the M1 success story is IO-related, which may give a glimpse on how bad things are on x86, where we instead get more cache to play with (and the 1001th SIMD instruction). Ofc the M1 has very different production needs and costs with its big-die-approach, no clue if that would even scale for broader industry adoption. However, it seems the M1 was needed to wake up x86 vendors from their slumber.
On a sidenote: I am really curious how the risc-v vector extension with its variable registers will turn out, it reads a bit like the re-invention of old Cray machines.
I am still very excited about Intel processors, especial AVX-512…
Pierre B.says:
While the fact that M1 can retire 8 instructions per cycle and x86 do not, you surely knows this is a very misleading representation.
The historical take of x86 on this subject has always been “well, our instructions do more per sintruction”, which has always been only half-true at best. OTOH, the modern x86 processors convert the x86 instructions to an internal, hidden, instruction architecture, which is what really gets executed. While I believe there is no data sheet that tells us how many of those internal instructions are retired by cycle, simple logic guarantees that the number is higher that the known x86-level instructions per cycles.
So, it is almost certain that Intel and AMD are retiring just as many instructions per cycles as the M1. It’s just that the instructions retired are the internal ones. As proof, one only has to look at the block diagram of such processor: the number of different execution units clearly indicates that they are handling more internal instructions than the IPC numbers indicates.
This also ignores nonlocal effects. It is entirely possible for all of the following to be true simultaneously:
80% of the time is spent in 20% of the code.
There is not much more optimization to be squeezed out of said 20% of the code.
The code can be optimized significantly.
If I write a piece of code that, for instance, temporarily grabs a huge amount of memory and causes the OS to drop most of its page cache as a result, said piece of code can be “fast” and thus not really show up on profiles… but its overall effect on the program can be huge.
This sort of aggressor/victim issue shows up all over the place: cpu cache, branch prediction, TLBs, OS page cache (and paging in general), texture cache, disk accesses, atomic contention, hyperthreading, JITter deoptimizations, etc, etc.
I’ve seen too much code that is far slower than it could be, because the authors have effectively given up on optimizing it further, in turn because they’ve heavily optimized the “hot” section and ignored the “cold” sections, missing that the reason the “hot” section is slow is that it is a victim of the “cold” sections.
I once made a custom hash table implementation as a python function. I used python’s c-api to request a function pointer to pythpn’s builtin hash function (SipHash). This performed well enough, but I found that using XXHash instead made the function even faster.
Then I started to benchmark the entire program rather than just the function, and the XXHash implementation made it slower. I think this is because python uses its own hash quite a lot (lots of things are python dictionaries under the hood) which kept the has in the L1 instruction cache. By the time my custom hash table function came around, fetching the instruction sequence for XXHash was slower than using python’s builtin hash.
So this is exactly as nonnull describes.
Another example is where I benchmarked a python script that did not take into account cache locality by operating one large lists. The slowness did not show up in the profiler. The script took 26 seconds but the cumulative time of all the functions profiled was less than 8. So I rewrote it to perform all the transformations on one item at the time using iterators. That made it 10 times faster.
I like to think of hotspot optimization as a way of getting closer to the local optimum; however not the global optimum.
One important thing about hotspot optimization is that the initial 2x wins are often trivial changes — and the sad part is that people don’t do those initial optimizations. The “trivial thing” is usually something like, “Oh, you forgot to configure your database connection pooling to allow multiple connections”. It’s also often not because they don’t know how to do the trivial optimization, but because they haven’t realized that there is a problem in a first place.
But as you’ve said, there’s only so far you can go with them. If I don’t see an obvious place in the profile to make the small fix, then I’ll stop.
“Premature optimization is the root of all” evil is sometimes attributed to Hoare, but it was Knuth.
Hi Daniel,
A fellow academic and a long time reader/fan of your blog here.
This is a common belief/recommendation that “optimizing the error path is unnecessary because it rarely happens”. Our research shows that slow error paths can lead to what we call “metastable failures”. Please see this paper for a high-level description of these failures, specifically section 2.3, and this paper for in-the-wild examples. Here’s a public example (also cited in the 2nd paper) of a case where slow error path caused a metastable failure: https://engineering.atspotify.com/2013/06/incident-management-at-spotify/
Cheers
Yes, slow paths can be a vulnerability.
A good idea is decoupling the 80/20 rule from the conclusion that change must be applied to that hot 20% of code.
Another is to broaden the 20/80 rule to data, rather than just code. Profilers can help by highlighting hot data.
Have been confronted with the “premature optimization…” cite more than once as a sloppy excuse for weak code, worst case was (me) “you have there a quadratic runtime, which can be done in linear” – “yeah, already figured, but it was more straight forward to code it this way, and btw premature optimization…” – Haha, well the problem was sleeping for 2 months before it shot back, just when the first real data got processed…
I always understood Knuth’s cite more as a warning not to go down the rabbit hole chasing the last <5% of optimization, which gets harder and harder if the lowing hanging fruits are already fixed. But using it as an excuse not to think about a better approach in the first place is imho just wrong.
Sometimes I wonder if CS courses should emphasize bare metal mechs (computer architecture) & complexity more again, as the knowledge, what a machine really has to do about the data, seems to get thinner. Ofc this is only anecdotical evidence from my side, but it got more of an issue in our local university, when they moved from C to Java as their introduction programming course. “The GC will fix that for me” is already pretty close to “dont make me think”. Oh wait there is ChatGPT for the rescue – isn’t it? 😉
The quote
The quote will more context
Is Knuth from, “Structured Programming with go to Statements” on page 8 of 41. In the next paragraph the important part that everyone skips.
The Knuth performance quote gets abused almost as often as programmers optimize the wrong things. It doesn’t look like Knuth is arguing against architectural efficiency or thinking about performance. The context is optimizing existing code needlessly w/o measuring. And that measuring should be a critical part of developing a system, so much so that it should be always-on.
The fatal flaw in this approach is that you can have regions of inexpensive code that nevertheless significantly slow down the program overall. (For instance, forcing the OS to drop its page cache.) I talk about this more in my prior comment.)
If you require prior measurement of “need” to optimize a piece of code, you’ll miss things like this and end up in a penny-wise-pound-foolish local optimum, where everything that takes a lot of time has been overoptimized, and yet you are still slow.
Would extend this issue to the whole (operating) system – in the end everything is fighting over register usage, cache lines, mem pages and so on, and the kernel tries to balance things out from its schedulers. But with wicked code you can raise the kernels runtime significantly (e.g. by sheer syscall pressure or lousy memory handling) – and the whole system may suffer.
The interesting bit here is the fact, that nowadays performance concerns are much more IO-related, while in the past the processing speed of the CPUs was a big limiting factor, too. We kinda put CPUs on steroids, but somehow the peripherals did not keep up. Which puts a major burden on clever resource management on kernel side.
My impression is that CPU performance has long stagnated due to the failure of Dennard scaling, and the repeated failures of the leader (Intel). However, we kept getting more bandwidth.
In the last decade, our disk IO was multiplied by at least 10. My PlayStation 5 has a 5 GB/s disk. The network speeds also improved by a similar factor.
Our single-processor performance surely did not improve in a similar manner after controlling for prices. That is, you can get a 32-core Intel PC today that will be worth about 10 CPUs from ten years ago, but it is going to be expensive. Of course, GPUs came to the rescue, but GPUs are specialized devices.
Yes true, just looking at the last 10ys we had bigger shifts on IO side. IO lost most of its ground in the 90s and early 2000s and has still way to go.
To me it seems, that most of the M1 success story is IO-related, which may give a glimpse on how bad things are on x86, where we instead get more cache to play with (and the 1001th SIMD instruction). Ofc the M1 has very different production needs and costs with its big-die-approach, no clue if that would even scale for broader industry adoption. However, it seems the M1 was needed to wake up x86 vendors from their slumber.
On a sidenote: I am really curious how the risc-v vector extension with its variable registers will turn out, it reads a bit like the re-invention of old Cray machines.
To me it seems, that most of the M1 success story is IO-related,
The M1 can retire 8 instructions per cycle, something that no x64 can do.
The ARM-based Graviton processors on AWS can beat the best of the Intel server processors on a computational task (parsing URLs): https://lemire.me/blog/2023/03/01/arm-vs-intel-on-amazons-cloud/
I am still very excited about Intel processors, especial AVX-512…
While the fact that M1 can retire 8 instructions per cycle and x86 do not, you surely knows this is a very misleading representation.
The historical take of x86 on this subject has always been “well, our instructions do more per sintruction”, which has always been only half-true at best. OTOH, the modern x86 processors convert the x86 instructions to an internal, hidden, instruction architecture, which is what really gets executed. While I believe there is no data sheet that tells us how many of those internal instructions are retired by cycle, simple logic guarantees that the number is higher that the known x86-level instructions per cycles.
So, it is almost certain that Intel and AMD are retiring just as many instructions per cycles as the M1. It’s just that the instructions retired are the internal ones. As proof, one only has to look at the block diagram of such processor: the number of different execution units clearly indicates that they are handling more internal instructions than the IPC numbers indicates.
it is almost certain that Intel and AMD are retiring just as many instructions per cycles as the M1.
I think you meant to refer to micro-operations.
See
https://lemire.me/blog/2023/05/12/arm-instructions-do-less-work/
If you keep buying Intel skill issue.