You do not need to use the scientific notation to express your results using a given number of significant digits. Of course, the scientific notation is more precise in this respect, but it is also less readable.
Travis Downssays:
It seems fine.
The reported times are a blink of an eye: a less than 1% of one millisecond in the case of the Intel server. That’s a small fraction of almost any total process runtime (i.e., much faster than a “hello world” process), so when it comes to amortization, I think you’ll either have enough work, or it didn’t matter anyway.
What would be interesting would be to an evaluation of the fastest amortized methods of using threads. I still don’t think functions of ~100 cycles are going to see a benefit, just due to inter-core communication costs, but you can probably squeeze out a benefit at 1,000 cycles.
All this definitely means that extracting performance from threads isn’t easy – it’s a hard engineering problem.
Sandeepsays:
We use thread pools at work quite a lot, and it is really difficult to squeeze out full performance via them.
According to Martin Thompson (https://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html), his inter core latency is 45ns, however for real physical cores, I have never been able to get below ~75ns on skylake (same numa) which implies a roundtrip time of ~150 (~500-600 cycles). All these are sort of theoretical numbers, so I agree with your assertion that if the task is taking less than < 1000 cycles, you can’t squeeze out performance via threading.
Travis Downssays:
Yeah I think 1,000 is where it starts to become interesting, since now your inter-core RT times are say single-digit % of the total time, so it starts to seem plausible, but it is a hard problem to exploit it: you have virtually no slack.
Forget even making one kernel call per thread, it has to be zero on average.
Stephen P. Schaefersays:
Why do kernel calls diminish thread benefits? If other threads can can continue to do useful work, what burden does the kernel call in one thread impose on the others?
John Dubchaksays:
I like your scale here. It puts it into terms that engineers can start to reason about as they design their applications and think about introducing concurrency primitives.
I think that’s different. Here I am measuring overhead…
If you have a small, embarrassingly parallel problem, Amdahl’s law will allow for good results… but if the problem is too small, you won’t get good results.
Suppose you have four independent jobs to do, but it takes you one hour on the phone to hire an intern to do them. You might be better off doing the four tasks yourself.
Travis Downssays:
I think you can use a variant of Amdahl’s law and this is kind of how I think of it.
Essentially, the overhead is like the serial portion in Amdahl’s law. So if the overhead is 1000 ns and the work, single threaded is 3000 ns, you use a serial fraction of 0.25. Really, the overhead is not serial, but rather parallel but proportional to the number of threads but that works out to almost the same thing.
One way of thinking about it is that the serial part is the part where the original thread is doing work, and the other threads are suffering their overhead period, after which there is a parallel phase.
This derivation is inexact in the sense that some of the overhead also occurs on the original thread, so the overhead is not really all parallel. I fact, I think this makes it kind of very not-Amdahl if most of the overhead ends up on the original thread, but I’m going to submit this comment anyways.
JEAN-BAPTISTE NIVOITsays:
I like the way you put it.
To me, the “p” parameter in Amdahl’s law is the “total work”, and that work is composed of 1/ overhead and 2/ useful work that can be carried out in parallel.
Furthermore, the overhead has 2 components: one that is spent on the main thread while spawning another thread (or communicating with other threads) and another that is spent on the worker thread before doing useful work and that is just the cost of acquiring the task (for instance it’s the cost of dequeueing the task).
I agree with Travis that the cost of spawning threads or communicating with a pool of threads is proportional to the number of workers (i.e. it gets worse when there are more workers).
Another note, just doing “counter++” causes inter-core traffic and false sharing, so this example might even be susceptible to overhead of the CPU architecture (not even talking about OS or C runtime overhead).
Another note, just doing “counter++” causes inter-core traffic and false sharing
I am not sure that there is false sharing. My own definition is of false sharing is when two active threads modify the same cache line, but at different locations in the cache line.
JEAN-BAPTISTE NIVOITsays:
That may be the strict definition of false sharing, but the effects are the same: the cache lines must be synchronized between cores and that makes other cores stall when the variable is modified on one core (marked “E” in the MESI protocol). This forces local caches to wait and refresh their copy, only to attempt making the line exclusive locally.
I understand and agree with what you write, but coming back to the terminology, why would it be “false” sharing? My understanding of “false sharing” is that you may have two threads that appear to work on different data… and they are… except that if the cache lines are the same, then it is as if they shared the same data. Hence the qualifier “false” (they are not really sharing).
BTW I have updated my code with an empty thread benchmark. The empty thread is slightly faster as you would expect.
JEAN-BAPTISTEsays:
You are right, I was abusing the meaning of “false sharing”. Indeed in the case of this code, the sharing is very much true 🙂
My sentence was ‘just doing “counter++” causes inter-core traffic and false sharing’, so ‘inter-core traffic’ (a.k.a true sharing) is the concern here, although there could also be false sharing if so other variable colocated with “counter” happens to be read/written from other threads (not the case in your code).
Travis Downssays:
I agree, but I guess the effect is probably very small here because Daniel is only creating one thread at a time and this type of inter-core communication is probably on the order of 100 ns, but the tests results are roughly 10,000 ns per thread in the best case.
This kind of stuff becomes much more important once you’ve eliminated the overhead of creating and tearing down a thread for each unit of work, and especially once you’ve eliminated any kernel calls (at which point the 100 ns matters and may be the dominant factor limiting how much you can do in parallel).
Marius Miku?ionissays:
I am surprised about such great variability among CPUs with the same kernel, thanks for that. Other OSes would probably be a terrible publicity stunt..
Could you try the following cheat?
auto f = std::async(std::launch::deferred, []{counter++;}) ;
f.get();
Also, C++ does not have standard thread pools, but the C++17 parallel algorithms have to manage threads automagically somehow. It would be interesting to inspect their overhead. There are several competing implementations though.
I have a recollection from my undergrad years, when I was reading some of the Solaris man pages, that they advised that you shouldn’t create a new thread for a task unless that task had a minimum of 1,000 instructions required, otherwise the overhead wasn’t worth it.
It’s a bit curious that 20+ years later, despite advances in hardware, your own analysis comes to about the same conclusion. Maybe I shouldn’t be surprised… even if the instructions are 10x faster now (say), I suppose the ratios would all stay the same.
Karen Tsays:
Does Rust have more or less overhead than C++ when creating threads?
Olzvoisays:
It woud be interesting to compare the energy costs of the three CPUs. ARM is said to be energy efficient, but it took more time for the task.
So, how much energy per task does the CPUs consume?
Structuring your code to use thread pools could be helpful in some situations. Same purpose as database connections pool. We know that establish a connection is slow, you need to re-use them instead.
Tarun Elankathsays:
If you are measuring thread overhead creation+termination time, then why are you using a shared counter and incrementing the same ? That will likely consume a large component of your time in a loop.
I am curious if you are following your practice of significant digits when reporting these numbers? 🙂
https://lemire.me/blog/2012/04/20/computer-scientists-need-to-learn-about-significant-digits/
I actually do. I use one significant digit.
But you don’t use scientific notation, so nobody knows.
You do not need to use the scientific notation to express your results using a given number of significant digits. Of course, the scientific notation is more precise in this respect, but it is also less readable.
It seems fine.
The reported times are a blink of an eye: a less than 1% of one millisecond in the case of the Intel server. That’s a small fraction of almost any total process runtime (i.e., much faster than a “hello world” process), so when it comes to amortization, I think you’ll either have enough work, or it didn’t matter anyway.
What would be interesting would be to an evaluation of the fastest amortized methods of using threads. I still don’t think functions of ~100 cycles are going to see a benefit, just due to inter-core communication costs, but you can probably squeeze out a benefit at 1,000 cycles.
All this definitely means that extracting performance from threads isn’t easy – it’s a hard engineering problem.
We use thread pools at work quite a lot, and it is really difficult to squeeze out full performance via them.
According to Martin Thompson (https://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html), his inter core latency is 45ns, however for real physical cores, I have never been able to get below ~75ns on skylake (same numa) which implies a roundtrip time of ~150 (~500-600 cycles). All these are sort of theoretical numbers, so I agree with your assertion that if the task is taking less than < 1000 cycles, you can’t squeeze out performance via threading.
Yeah I think 1,000 is where it starts to become interesting, since now your inter-core RT times are say single-digit % of the total time, so it starts to seem plausible, but it is a hard problem to exploit it: you have virtually no slack.
Forget even making one kernel call per thread, it has to be zero on average.
Why do kernel calls diminish thread benefits? If other threads can can continue to do useful work, what burden does the kernel call in one thread impose on the others?
I like your scale here. It puts it into terms that engineers can start to reason about as they design their applications and think about introducing concurrency primitives.
Wouldn’t one plug this cost into the formula to https://en.wikipedia.org/wiki/Amdahl%27s_law to determine when it’s worth attempting to use multiple threads ?
I think that’s different. Here I am measuring overhead…
If you have a small, embarrassingly parallel problem, Amdahl’s law will allow for good results… but if the problem is too small, you won’t get good results.
Suppose you have four independent jobs to do, but it takes you one hour on the phone to hire an intern to do them. You might be better off doing the four tasks yourself.
I think you can use a variant of Amdahl’s law and this is kind of how I think of it.
Essentially, the overhead is like the serial portion in Amdahl’s law. So if the overhead is 1000 ns and the work, single threaded is 3000 ns, you use a serial fraction of 0.25. Really, the overhead is not serial, but rather parallel but proportional to the number of threads but that works out to almost the same thing.
One way of thinking about it is that the serial part is the part where the original thread is doing work, and the other threads are suffering their overhead period, after which there is a parallel phase.
This derivation is inexact in the sense that some of the overhead also occurs on the original thread, so the overhead is not really all parallel. I fact, I think this makes it kind of very not-Amdahl if most of the overhead ends up on the original thread, but I’m going to submit this comment anyways.
I like the way you put it.
To me, the “p” parameter in Amdahl’s law is the “total work”, and that work is composed of 1/ overhead and 2/ useful work that can be carried out in parallel.
Furthermore, the overhead has 2 components: one that is spent on the main thread while spawning another thread (or communicating with other threads) and another that is spent on the worker thread before doing useful work and that is just the cost of acquiring the task (for instance it’s the cost of dequeueing the task).
I agree with Travis that the cost of spawning threads or communicating with a pool of threads is proportional to the number of workers (i.e. it gets worse when there are more workers).
Another note, just doing “counter++” causes inter-core traffic and false sharing, so this example might even be susceptible to overhead of the CPU architecture (not even talking about OS or C runtime overhead).
I am not sure that there is false sharing. My own definition is of false sharing is when two active threads modify the same cache line, but at different locations in the cache line.
That may be the strict definition of false sharing, but the effects are the same: the cache lines must be synchronized between cores and that makes other cores stall when the variable is modified on one core (marked “E” in the MESI protocol). This forces local caches to wait and refresh their copy, only to attempt making the line exclusive locally.
I understand and agree with what you write, but coming back to the terminology, why would it be “false” sharing? My understanding of “false sharing” is that you may have two threads that appear to work on different data… and they are… except that if the cache lines are the same, then it is as if they shared the same data. Hence the qualifier “false” (they are not really sharing).
BTW I have updated my code with an empty thread benchmark. The empty thread is slightly faster as you would expect.
You are right, I was abusing the meaning of “false sharing”. Indeed in the case of this code, the sharing is very much true 🙂
My sentence was ‘just doing “counter++” causes inter-core traffic and false sharing’, so ‘inter-core traffic’ (a.k.a true sharing) is the concern here, although there could also be false sharing if so other variable colocated with “counter” happens to be read/written from other threads (not the case in your code).
I agree, but I guess the effect is probably very small here because Daniel is only creating one thread at a time and this type of inter-core communication is probably on the order of 100 ns, but the tests results are roughly 10,000 ns per thread in the best case.
This kind of stuff becomes much more important once you’ve eliminated the overhead of creating and tearing down a thread for each unit of work, and especially once you’ve eliminated any kernel calls (at which point the 100 ns matters and may be the dominant factor limiting how much you can do in parallel).
I am surprised about such great variability among CPUs with the same kernel, thanks for that. Other OSes would probably be a terrible publicity stunt..
Could you try the following cheat?
auto f = std::async(std::launch::deferred, []{counter++;}) ;
f.get();
Also, C++ does not have standard thread pools, but the C++17 parallel algorithms have to manage threads automagically somehow. It would be interesting to inspect their overhead. There are several competing implementations though.
I added async but with std::launch::async. I think that std::launch::deferred is guarantee not to start a new thread so it is not relevant.
I have a recollection from my undergrad years, when I was reading some of the Solaris man pages, that they advised that you shouldn’t create a new thread for a task unless that task had a minimum of 1,000 instructions required, otherwise the overhead wasn’t worth it.
It’s a bit curious that 20+ years later, despite advances in hardware, your own analysis comes to about the same conclusion. Maybe I shouldn’t be surprised… even if the instructions are 10x faster now (say), I suppose the ratios would all stay the same.
Does Rust have more or less overhead than C++ when creating threads?
It woud be interesting to compare the energy costs of the three CPUs. ARM is said to be energy efficient, but it took more time for the task.
So, how much energy per task does the CPUs consume?
Structuring your code to use thread pools could be helpful in some situations. Same purpose as database connections pool. We know that establish a connection is slow, you need to re-use them instead.
If you are measuring thread overhead creation+termination time, then why are you using a shared counter and incrementing the same ? That will likely consume a large component of your time in a loop.
Incrementing a counter is a negligible cost.