Very interesting. A lot has changed since I studied algorithmics. It is not easy to keep track of technological advances and their impacts on how to write fast code, even more so when programming is not a central part of one’s job.
Is the graph from a paper or it is based on an ad-hoc test you made for this blog post specifically?
Yes, I mean Cannon Lake. I am still hoping to get an Ice Lake server but I could only find laptops so far.
Cannon Lake is a sane CPU, but yes it is uncommon.
Adam Scottsays:
Very informative and the code looks great. In my experience profiling on a SandyBridge CPU, I saw what looked like a memory performance bottleneck. Now I have a good way to compare. It looks like C++11 is a requirement, so will use on newer machines.
Ryansays:
How does memory parallelism relate to SIMD parallelism? i.e. Would a SIMD instruction only need 1 read to access a chunk of data.
On the CPU, are multiple nearby memory requests coalesced into a single read?
Reading into a SIMD register can count issue a single instruction and load more data than is possible with a general-purpose register.
But it is not clear to me how this relate to the numbers I provide here: you access cache lines (typically 64 bytes) even if you only need 1 byte.
Travis Downssays:
That’s a good result from Zen2. I wonder there was an improvement here over Zen? I thought Zen topped out at around 12-16 but I could be mis-remembering.
In fact we can’t even tell from that chart where Zen2 (or CNL, probably) even tops out.
Do I read the chart correctly that all 3 systems have nearly identical single-lane throughput, or was it normalized somehow?
@Travis I posted the raw results. See the end of the post. Not, there was no normalization, this is the computed bandwidth.
I think that there are differences in single-lane bandwidth, though it is not large. You can see it just with the plot (the lines don’t quite overlap).
Travis Downssays:
Ah, I see the raw results. There is a flaw with using clock() on some systems and it is evident here: the resolution is very low. I had it on my TODO to fix it, but never go around to it I guess.
Travis Downssays:
Unless I am misreading something, the data does not correspond to the chart? E.g., the last three BW values for Rome are all identical (9752) for 23, 24, 25 but the purple line in the chart is clearly different (and seems to be > 9752) as it does not flatline for those points.
You are correct. I just forgot to update the numbers, though I did update the figure. So the mistake I made initially was in not using huge pages.
I’ll push an update.
Travis Downssays:
I’m working on some updates to the tool that among other things avoid the poor resolution of clock() on CentSO, so the chart won’t be so quantized.
Travis Downssays:
MLP analysis has cracked the mainstream tech press, see for example this Zen review by Andrei and Gavin at AnandTech.
I like the way the chart is made, across various sizes and normalized to the 1-mlp speed (so the y-axis is “speedup relative to 1-mlp”).
Here are some more charts in this vein that I made now. Andrei claims that SKX can reach much more than 10, MLP, e.g., based on this chart which shows it hitting speedups of more than 25x, but I have to think this is measurement error. I couldn’t replicate it (admittedly on different-but-stil-SKX hardware).
Very interesting. A lot has changed since I studied algorithmics. It is not easy to keep track of technological advances and their impacts on how to write fast code, even more so when programming is not a central part of one’s job.
Is the graph from a paper or it is based on an ad-hoc test you made for this blog post specifically?
The graph was built just for the blog post. I make the raw results available. I use the testingmlp software package.
Is it possible to see the source code? I am intrigued by the memory lines correlation.
*I meant lanes, of course
I provide a link to the source code, see at the bottom of the post.
I somewhat doubt you were testing Cannon Lake…? That’s a broken CPU shipped in very low quantities and by now EOL. Did you mean Ice Lake?
Yes, I mean Cannon Lake. I am still hoping to get an Ice Lake server but I could only find laptops so far.
Cannon Lake is a sane CPU, but yes it is uncommon.
Very informative and the code looks great. In my experience profiling on a SandyBridge CPU, I saw what looked like a memory performance bottleneck. Now I have a good way to compare. It looks like C++11 is a requirement, so will use on newer machines.
How does memory parallelism relate to SIMD parallelism? i.e. Would a SIMD instruction only need 1 read to access a chunk of data.
On the CPU, are multiple nearby memory requests coalesced into a single read?
Reading into a SIMD register can count issue a single instruction and load more data than is possible with a general-purpose register.
But it is not clear to me how this relate to the numbers I provide here: you access cache lines (typically 64 bytes) even if you only need 1 byte.
That’s a good result from Zen2. I wonder there was an improvement here over Zen? I thought Zen topped out at around 12-16 but I could be mis-remembering.
In fact we can’t even tell from that chart where Zen2 (or CNL, probably) even tops out.
Do I read the chart correctly that all 3 systems have nearly identical single-lane throughput, or was it normalized somehow?
@Travis I posted the raw results. See the end of the post. Not, there was no normalization, this is the computed bandwidth.
I think that there are differences in single-lane bandwidth, though it is not large. You can see it just with the plot (the lines don’t quite overlap).
Ah, I see the raw results. There is a flaw with using clock() on some systems and it is evident here: the resolution is very low. I had it on my TODO to fix it, but never go around to it I guess.
Unless I am misreading something, the data does not correspond to the chart? E.g., the last three BW values for Rome are all identical (9752) for 23, 24, 25 but the purple line in the chart is clearly different (and seems to be > 9752) as it does not flatline for those points.
You are correct. I just forgot to update the numbers, though I did update the figure. So the mistake I made initially was in not using huge pages.
I’ll push an update.
I’m working on some updates to the tool that among other things avoid the poor resolution of clock() on CentSO, so the chart won’t be so quantized.
MLP analysis has cracked the mainstream tech press, see for example this Zen review by Andrei and Gavin at AnandTech.
I like the way the chart is made, across various sizes and normalized to the 1-mlp speed (so the y-axis is “speedup relative to 1-mlp”).
Here are some more charts in this vein that I made now. Andrei claims that SKX can reach much more than 10, MLP, e.g., based on this chart which shows it hitting speedups of more than 25x, but I have to think this is measurement error. I couldn’t replicate it (admittedly on different-but-stil-SKX hardware).