You speak of performance ‘in the small’, e.g. “how do I make this map lookup faster” instead of “can I use precise dataflow analysis to eliminate the intermediate storage for the tuple space?” I think this isn’t unusual. Computer scientists tend to focus on the problem in front of them, not the larger context.
Unfortunately, this narrow focus a great deal of impedance and inefficiency at larger scales. Often, it is necessary to relinquish ‘local’ efficiency to improve ‘global’ efficiency – e.g. switch from high-performance batch-processing algorithms to lower-performance variations that can be pipelined compositionally.
I think we still have plenty to learn about performance in-the-large. And there is much to gain, especially with regards to removing walls between applications.
What I didn’t know about performance when I moved from academia to industry was variance. When I was at SpeechWorks, I had to build a parser that ran in under 60ms on average, but never peaked above 200ms. The garbage collector in JavaScript killed us. Everything ran in 20ms on average, but JavaScript kept peaking times at over a minute with garbage collection. So the solution was to rebuild and tear down the virtual machine each call. Took about 20ms, so average run time was now 40ms, and no more bumps from garbage collection.
Similarly, there’s a notion of latency vs. bandwidth in accessing disk or solid-state drive or memory or cache or a register that’s very very important in real applications, but you tend not to think about a lot in analysis-of-algorithms classes where there’s usually a memory abstraction (in both constant-time access and unlimited size).
@David Barbour: Even the basic analysis of algorithms books like Cormen, Leisersohn and Rivest touch on these issues when discussing the motivations for algorithms like B-trees. It blew my mind when I first realized this also applied at the memory level. I at first thought my test harness was misbehaving or something was getting compiled away when I added a bunch more floating point arithmetic per loop and the timing didn’t change. It finally dawned on me that my algorithm was memory bound. I don’t think many people realize just how expensive cache misses or failed branch predictions are. Large natural language processing models are problematic because on average, every other lookup or so is a cache miss (there’s a very long tail, and it’s important for prediction).
Valrandirsays:
Bob Carpenter, the solution was simply “do not use java”.
It was JavaScript (aka ECMAScript) wrapped in C, wrapped in C++, and it wasn’t me, it was the VoiceXML standard dictating JavaScript.
Java has the same issues as C++ in a lot of ways, but the garbage collector can be a killer.
I’m doing numerical analysis these days, and it’s all C++ template metaprogramming to move as much computation down to the static (compile time) level as possible.
“…pipeline width, number of units, throughput and latency of the various instructions, memory latency and bandwidth, CPU caching strategies, CPU branching predictions, instruction reordering, superscalar execution, compiler heuristics and vectorization… and so on.”
Maybe I’m biased because I went to schools where there is a sharp division between CS and ECE (Waterloo and Toronto), but those factors fall squarely in Computer Engineering, not Science. Not saying they should never intersect, but generally they don’t.
Not in any micro-architectural detail I think. CS (at UW and UofT, at least) is very much a branch of the Math Dept., not Engineering. So there is mostly emphasis on proofs, not the messy details of actual hardware.
(Again, not saying this is bad, and I’m glad *they* can go prove things, and then explain them to me so I can build them. I can’t prove my way out of a wet paper bag. 🙂 )
And yes, the knowledge exists, but it’s not widespread yet. (re: cache-oblivious algorithms)
Computer scientists have been aware of caching issues for decades. It is a core element of computer science… part of any sane algorithm textbook. So yes, you have cache-aware and cache-oblivious algorithms. But this is essentially mathematical modelling… sometimes with little to no validation in the real world.
Poul-Henning Kamp is basically saying what I am saying: that computer scientists don’t know nearly as much as they think they do about performance.
Regarding mathematics… there is an infinite number of results that you can prove. An infinite subset of them will appear interesting to some human beings. It does not mean that coming up with one more mathematical result is a valuable contribution… if you take “valuable” as in “making people’s life better”.
You speak of performance ‘in the small’, e.g. “how do I make this map lookup faster” instead of “can I use precise dataflow analysis to eliminate the intermediate storage for the tuple space?” I think this isn’t unusual. Computer scientists tend to focus on the problem in front of them, not the larger context.
Unfortunately, this narrow focus a great deal of impedance and inefficiency at larger scales. Often, it is necessary to relinquish ‘local’ efficiency to improve ‘global’ efficiency – e.g. switch from high-performance batch-processing algorithms to lower-performance variations that can be pipelined compositionally.
I think we still have plenty to learn about performance in-the-large. And there is much to gain, especially with regards to removing walls between applications.
I teach a fourth-year course which spends time on the constant factors, Programming for Performance. http://patricklam.ca/p4p.
@plam
It is obviously a great course. BTW my point is not that there aren’t courses in computer science programs about performance.
What I didn’t know about performance when I moved from academia to industry was variance. When I was at SpeechWorks, I had to build a parser that ran in under 60ms on average, but never peaked above 200ms. The garbage collector in JavaScript killed us. Everything ran in 20ms on average, but JavaScript kept peaking times at over a minute with garbage collection. So the solution was to rebuild and tear down the virtual machine each call. Took about 20ms, so average run time was now 40ms, and no more bumps from garbage collection.
Similarly, there’s a notion of latency vs. bandwidth in accessing disk or solid-state drive or memory or cache or a register that’s very very important in real applications, but you tend not to think about a lot in analysis-of-algorithms classes where there’s usually a memory abstraction (in both constant-time access and unlimited size).
@David Barbour: Even the basic analysis of algorithms books like Cormen, Leisersohn and Rivest touch on these issues when discussing the motivations for algorithms like B-trees. It blew my mind when I first realized this also applied at the memory level. I at first thought my test harness was misbehaving or something was getting compiled away when I added a bunch more floating point arithmetic per loop and the timing didn’t change. It finally dawned on me that my algorithm was memory bound. I don’t think many people realize just how expensive cache misses or failed branch predictions are. Large natural language processing models are problematic because on average, every other lookup or so is a cache miss (there’s a very long tail, and it’s important for prediction).
Bob Carpenter, the solution was simply “do not use java”.
It was JavaScript (aka ECMAScript) wrapped in C, wrapped in C++, and it wasn’t me, it was the VoiceXML standard dictating JavaScript.
Java has the same issues as C++ in a lot of ways, but the garbage collector can be a killer.
I’m doing numerical analysis these days, and it’s all C++ template metaprogramming to move as much computation down to the static (compile time) level as possible.
“…pipeline width, number of units, throughput and latency of the various instructions, memory latency and bandwidth, CPU caching strategies, CPU branching predictions, instruction reordering, superscalar execution, compiler heuristics and vectorization… and so on.”
Maybe I’m biased because I went to schools where there is a sharp division between CS and ECE (Waterloo and Toronto), but those factors fall squarely in Computer Engineering, not Science. Not saying they should never intersect, but generally they don’t.
@LaForest
How a CPU works is certainly part of a computer science education… no?
@Lemire
Not in any micro-architectural detail I think. CS (at UW and UofT, at least) is very much a branch of the Math Dept., not Engineering. So there is mostly emphasis on proofs, not the messy details of actual hardware.
(Again, not saying this is bad, and I’m glad *they* can go prove things, and then explain them to me so I can build them. I can’t prove my way out of a wet paper bag. 🙂 )
And yes, the knowledge exists, but it’s not widespread yet. (re: cache-oblivious algorithms)
Case in point: https://queue.acm.org/detail.cfm?id=1814327
@Eric
Computer scientists have been aware of caching issues for decades. It is a core element of computer science… part of any sane algorithm textbook. So yes, you have cache-aware and cache-oblivious algorithms. But this is essentially mathematical modelling… sometimes with little to no validation in the real world.
Poul-Henning Kamp is basically saying what I am saying: that computer scientists don’t know nearly as much as they think they do about performance.
Regarding mathematics… there is an infinite number of results that you can prove. An infinite subset of them will appear interesting to some human beings. It does not mean that coming up with one more mathematical result is a valuable contribution… if you take “valuable” as in “making people’s life better”.