Yes, I entered the numbers in reverse, this has been fixed.
Nigel Horspoolsays:
What is really making Intel nervous is that at each price point, the AMD processor has many more cores than the corresponding Intel processor AND consumes less power (an important issue for server farms).
Nigel: I am quite excited about AMD being back in the race…
Jens Nurmannsays:
The second one seems to be a frustrating example of “implementational divergence due to instruction set bloat” on the AMD side IMHO. Looking up instruction throughput on Zen 2 one finds
bsf / bsr – 3 / 4 cycles on r64
tzcnt / lzcnt – 0.5 / 1 cycles on r64
I’d assume that the compiler generates bsf in your benchmark – if it is the one you presented some time ago. So I am surprised that this is “only” 1/2 of Intel IPC for AMD in this case. Replacing bsf with tzcnt might reverse the situation.
Clock speed is not very relevant because these numbers are per-cycle. Cache is not also very relevant since these are not memory bound benchmarks.
Benjaminsays:
Couple of open questions:
– were the Spectre and following mitigations applied on both rigs? That can go a long way explaining differences in the ~10/15% range, but not a x2 factor of course
if the build is CPU specific, counting instructions seems like a weird way to measure performance, since as you mentioned some instructions are a lot wider than others. By this metric an AVX512 build of a given benchmark could give pretty bad results when compared to an SSE build (which is not true with any metric which actually counts in that case, like throughput or perf/watt) if the build is not CPU specific, counting instruction throughput is only interesting if this is a close enough optimal build for both, IMO. One could imagine a CPU which is very good at extracting ILP from low performance builds, which would be a nice skill but could be useless in an HPC context for instance.
you keep mentioning an “old Intel CPU”, but skylake is basically the only available architecture for anything but some thin laptops. So it’s both “old” and “current”, which contributes to making AMD competitive
this being said I agree with your initial point that “better IPC” claims are not really qualified. I guess that the implicit meaning is “getting more work done per clock cycle”.
Benjaminsays:
another point is that you discard benchmarks which are memory bound, but that goes against some other tests that you did concerning memory requests parallelism for instance. Extracting good IPC in memory starved contexts is also meaningful, right ?
In memory starved contexts, the number of instructions being retired is probably not the measure you care about. Instead, you might want to report the effective bandwidth or something that has to do with the actual bottleneck.
Benjaminsays:
I disagree on that, it is my understanding that the IPC that manage to go through would be a good proxy for job being done, despite the bottleneck. This whole “job being done” is I think the logic behind most of the “IPC” claims around
We can reason about IPC for instruction-dense code. We know what 4.0 instructions per cycle means: it is great. For instruction-dense code, 1.0 is going to be mediocre. Basically, we have a measure of how superscalar (wide) the processor is. Achieving 6 instructions per cycle in real code would be fantastic.
For memory-bound problems, what would be a good IPC?… is 0.1 instructions per cycle good or bad? I can’t reason about it. I have some idea of what a bandwidth of 10 GB/s in random access means (it is very good).
In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. On a per-cycle basis, in this AVX2-intensive benchmark, AMD comes down under Intel in every way.
you keep mentioning an “old Intel CPU”, but skylake is basically the
only available architecture for anything but some thin laptops. So
it’s both “old” and “current”, which contributes to making AMD
competitive
That is true.
Benjaminsays:
In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. Changing the builds (for instance -O3 vs vanilla) would change the instruction mix and throughput, all other things (task and hardware) being equal. So the correct quote is “for a given build, the number of instructions retired per cycle follows closely the performance per cycle.”, which may or may not be a good proxy for absolute performance (see AVX512 for instance)
True. There is just one x64 build here, same binary throughout.
RGRHONsays:
I’m just a lowly programmer, but the speed of a single gcc compile seems irrelevant to me. Both processors perform small tasks it in the blink of an eye, so if that blink is 10ms or 12ms isn’t usually very significant to me. Small compiles are pretty equivalent in terms of time required for gcc, and trying to extrapolate those small compile results on a single core to a large number of cores is missing the point. What is relevant is if you use a parallel make (make -j) with a large number of source files and cores like 24, 32, or 64, the AMD processor will usually beat the pants off a processor with a lower number of cores. Same with rendering and many other long tasks. That’s significant to me in my wall-clock development time. Sure, sometimes the AMD may be a bit slower on small tasks, but small tasks don’t take long, so a single TR core is usually only fractionally slower and works fine for small tasks anyway. TR is fast enough to game on my PC with maximum settings on almost every title at 1440p and above. Not that I game much, but it’s fine. Same for analytical graphics, they just don’t take that long on today’s 3000 Nvidia GPU’s. I’ll admit that there are some tasks where bleeding edge core speed is important, but Ive noticed I don’t do those things TA’s often as I compile, for example.
Travis Downssays:
Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison. Many choices made based on benchmark results might have gone a different way on the AMD machine, so the Intel specific quirks get built in this way.
I’m not saying it would reverse the conclusion in this case, but it’s something to remember when testing something that has been carefully tuned.
Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison.
That’s true, so it is a bias but I submit to you that the same bias exists on highly tuned software out there.
Furthermore, when people say that AMD Zen 2 has superior IPC, they rarely qualify this statement by saying that it requires tuning or recompiling the software. If that’s a requirement, it should be stated.
Travis Downssays:
Agreed, it is a bias that applies to other software, although I suspect SIMDJson is more highly tuned than the average, so I suggest it applies more in this case.
I don’t know about higher IPC, but when I say something like “Zen 2 has comparable IPC to Skylake” I don’t mean after recompiling. I just draw that conclusion from broad-based tests performed by others, on existing binaries without recompiling.
The IPC relationship between two different uarches isn’t constant across benchmarks so “comparable IPC on average across a range of benchmarks” doesn’t translate to “comparable IPC on every benchmark”. Quite the opposite, I’d expect any given benchmark to show an advantage for one platform or the other since they are not the same.
What you describe matches my expectation but I feel that there is some amount of hype in favor of AMD.
Travis Downssays:
I can’t speak for everyone, but from my part the hype isn’t that Zen 2 has higher IPC than Intel, or has released a better uarch than Intel, but that AMD has something at least roughly comparable, on average and is making it available at prices and core counts that undercut Intel by 50% or more.
After years of release the Skylake chip under a new name, and increasing the price each time, Intel has slashed many of their new chips by half over the old lines, and core counts on all parts are suddenly shooting up.
That’s what’s deserving of hype, not big microarchitectural improvements. From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!
Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.
Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit. Intel designed it and although AMD can use it due to their licence agreement but incorporating it takes a long time and typically architecture change and is not something that can just be added. Clock speed boosts vary quite a lot on load, if the load is short Intel will boost to maximum clock speed and never stabilise at the lower frequency after 30 seconds which you would normally see which can invalidate the results along with motherboards that may enable MCE as always on.
AMD don’t do instruction sets anymore as applications typically geared towards the most common MMX was preferred over AMD 3D Now even if 3D now was more efficient but Intel like it is today has a bigger market thus not worth investing in a specific instruction set when the devices are limited.
Ryzen architecture boosts vary quite a bit depending on background tasks running it could be 4.6GHz or it could be 4GHz. Due to the nature of the architecture and the way it boosts it can only maintain that frequency for a short time, less than a second and the load is moved over to another core this includes flushing the data from L1 and L2 cache and move over to the to other core then boost the new core at a higher frequency to continue the task this can add latency but is quite small as it would typically be in the same CCX.
When people go on about Zen 2 having a higher IPC they have it as apples to apples, i.e All CPU’s run at the same clock speed and see what architecture differences there are that distinguish from each at a given clock frequency without random boost clock speeds and use more generalised instruction sets like SSE4 skewing the results. If you have something that boosts randomly due to CPU temps, voltage, current ripple, VRM temps even ambient temperatures. If you can’t set a base line and have so many variables into your results then the end result is also useless
Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.
These are not memory-bound tests. The processor with the highest frequency in these tests is Skylake. Given that memory access is not a significant burden here, and that we report “per cycle” instructions, it is ok not to mention frequency. But if we do, then the Intel Skylake processor is maybe at a disadvantage.
Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit.
No, none of this code uses AVX-512.
If you can’t set a base line and have so many variables into your results then the end result is also useless
I disagree. I can measure reliably how many cycles a computationally intensive tasks take. Yes, if there are expensive cache misses, then we have an issue, but it is not the case here.
Archiesays:
Those influencers online that post stuff without specifying a lot of things. Short boost clock is extremely important in your test and that is what intel cpu’s are good at. If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
Short boost clock is extremely important in your test and that is what intel cpu’s are good at.
Short boost in the clock frequency would not be relevant. If anything, as Yoav pointed out in another comment, higher frequencies in Intel would mean lower IPC whenever memory latency is at issue.
If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
Please read my post again. I am explicit is stating that I believe AMD has probably better processors than Intel at this point. All I am saying is that we should qualify these statements.
Dariensays:
I see the flag -march=native in the Makefile. When these containers are built, which system is used?
In my tests, it is the same binary across systems.
Nathan Kurzsays:
I think Darien understands that it is the same binary, and is asking a different question. Since you used “-march=native”, you might get a different compilation depending on whether you generated the binary on AMD or Intel. In theory, this binary might always be faster on the machine it is compiled on than on the opposite machine. In practice, the assembly here is straightforward enough that this is unlikely the case. But it’s still a question worth asking, and worth answering.
Frequency is very important when measuring IPC. This is because memory latency doesn’t scale with frequency. So higher frequency means usually lower IPC.
Also the memory speed is very important.
@Yoav Memory latency is not the issue here. This being said, the Intel Skylake processor has a higher frequency so if there is any frequency-related bias, it would be favorable to AMD Rome.
Matthew Montgomerysays:
These results are really odd. The Zen core has much wider decoding and far more pipelines that can complete instructions. Did you optimize for both systems or just the skylake one?
I disagree. Neither of my tests is optimized for skylake or against zen.
Matthew Montgomerysays:
Then you don’t understand microarchitecture. Nehalem has the same core layout and scheduler layout as Skylake and Cannon lake. If it’s optimized for one, it’s optimized for all of them on the base level. Zen has a very different layout for both core and scheduler.
Skylake has 5 decoders and Zen has 4, but Zen can pack up to 8 instructions for those 4 and Skylake only 6. And that’s just the first example.
The simdjson is an open project and we could use help from someone who can help us optimize the code for Zen architectures. Please help out.
Even on AMD, it is the fastest JSON parser in existence as far as I know.
Matthew Montgomerysays:
After I code my game engine, maybe I’ll pop in and restructure some things. Maybe.
Travis Downssays:
Zen and Skylake are way more similar than either of those are to Nehalem.
In general though compilers favor code that is faster on Intel than AMD.
blahsays:
Who’s claiming Zen 2 has beat Intel on IPC? All the benches Ive seen puts the top mainstream i9 9900 and/or 9700 on top even against the mighty 3950x. The claim I’ve seen is that AMD has narrowed the IPC gap significantly while destroying Intel on multithreaded tasks by a very very large margin. It is my understanding that the IPC gap between Zen2 and 9th gen is small enough that it’s a better value to go with a Zen2 for a more robust CPU if you are a mix usage, content creation and gaming, etc.
Zen2 is not exactly cheap also. Platform costs (X570), memory cost (higher bandwidth memory) makes it a bit more expensive than an i9. The good news is that you can pop in a Zen2 into an X370 mobo…
It’s an exciting time for the PC market. competition drives innovation!
The link I offer in my post is one instance where folks claim that Zen 2 has better IPC.
I agree with your comment.
Roland Homokisays:
This article feels so unfinished.
Nothing specified about test setup, and answer is always avoided.
What specs (chipset, CPU, RAM, OS)? What settings for CPU and RAM?
Why so specific software and such small selection? Maybe inclusion of actual performance (not IPC) at the same clockspeed.
This blog post is specifically about IPC, not performance (please see the title).
The microarchitectures are specified; RAM and operating systems are not relevant.
Why so specific software and such small selection?
Because this is the software I care about. You will undoubtably run different code and software. That’s fine.
Ksecsays:
Came across this article on HN,
The problem is Guru3D, they are is not a technical site, so they got IPC wrong. But IPC in everybody’s ( or normal peoples ) terms is exactly that they / you describe, work per clock, so in that sense it is right for their target audience. ( May be it should be used as PPC, performance per clock )
Having said that even the AMD bias site and fans dont ever claim Zen 2 has better than Intel’s Single Core / Thread / IPC performance. Having better IPC / PPC is absolutely not the mainstream sentiment. As a matter of fact this is the first time I heard about it having casually serving a dozen of tech sites and social media.
Darren Rushworth-Mooresays:
See it’s not really the sites that’s at fault, it’t the manufactures that indicate what the “IPC” uplift is and this is before clock speeds are taking into account because they are at the stage in development where clock speeds have not yet been finalised as of yet thus sites like Guru3D etc set a baseline and to confirm if a specific manufacture is accurate in their assessment. Now maybe PPC would be a better assessment and neither AMD or Intel tried to diverge from it.
Intel typically will say 2% IPC gain before frequency as does AMD, while Intels typical 2% is within margin are error and people are like meh, no one really cares at that point but some people do indeed test it and do typically see a 2% uplift. They do it like this as both Intel and AMD are competitors yet at the same time they want people to be interested in their product while not giving away the full performance that frequency contributes to it.
People are more interested with AMD since when Zen was first launched 52% over piledriver, Zen+ 5%, Zen 2 13% and the upcoming Zen 3 15% with these percentages from Zen+ being compared to first Gen Zen, these are sizeable gains outside of margin of error. Then of course Intel and their Ice Lake’s 18% IPC uplift. How sites test the devices to determine if they are true or not, as in the past they have been over inflated but thus far it has been accurate at least for the Zen microarchitecture. Thus this is how a majority of people how/now understand what IPC is.
Keefsays:
I’m not necessarily a mega AMD fanboi and tend to suspect that the underlying point may still hold up. But this article feels very dubious since it carefully selected to benchmarks that would specifically be MOSTLY using AVX512 instructions on Intel, which AMD doesn’t have implemented. I’d like to see the performance difference using only x86-64 instructions with no SSE/AVX
Since tech sites compare Intel/AMD CPU performance, they are talking about the same binary file (and the same list of instrunctions to be executed), running on different x86-64 compatible CPUs.
In this specific but very common scenario, for any benchmark of fixed amount of work/instructions, “work per unit of time normalized per CPU frequency” = constant “instruction count of work” * “instructions per cycle”. These two work the same way hen comparing architectures.
So I suppose tech sites are not wrong. You and they just use different benchmarks and get different results.
If you present a plot where on the y-axis you claim to present the number of instructions per cycle and you give some other number, you are making a mistake.
For many benchmarks, the number of instructions is not the proper benchmark.
Furthermore, even if you have the same binary, there is no reason to think that the processors will execute the same instructions. Branch predictors, difference in ISA can all trigger different code paths.
ChrisGXsays:
From where I stand, work per unit of time normalized per CPU frequency, is the only consequential measure of IPC. IPC numbers offered by manufacturer only tell us about whether later generations of CPU achieve higher IPC than earlier generations of CPU. That is a relative measure only. It is nice to know that IPC is improving over successive years by this or that percentage but putting meat to the bones of IPC involves determining what those instructions are worth. You can only discover that by running appropriate benchmarks. And, that is why I say that work per unit of time normalized per CPU frequency is the more substantial way of thinking about IPC, whereas manufacturers numbers hold less significance.
The last table shows twice the IPC for the Zen 2, which is in contradiction to your conclusion. Did you swap the two value by any chance?
I’m wondering about the same thing. 🙂
Yes, I entered the numbers in reverse, this has been fixed.
What is really making Intel nervous is that at each price point, the AMD processor has many more cores than the corresponding Intel processor AND consumes less power (an important issue for server farms).
Nigel: I am quite excited about AMD being back in the race…
The second one seems to be a frustrating example of “implementational divergence due to instruction set bloat” on the AMD side IMHO. Looking up instruction throughput on Zen 2 one finds
bsf / bsr – 3 / 4 cycles on r64
tzcnt / lzcnt – 0.5 / 1 cycles on r64
I’d assume that the compiler generates bsf in your benchmark – if it is the one you presented some time ago. So I am surprised that this is “only” 1/2 of Intel IPC for AMD in this case. Replacing bsf with tzcnt might reverse the situation.
I have added my code to the blog post so that it is clearer, I specifically request tzcnt.
So, I don’t think that’s the issue.
Now I am surprised – most vexing. I’ll try to take a closer look at that.
I checked the assembly and tzcnt is generated.
Thanks Travis.
I don’t know why Zen 2 is inferior on this test but it is no conspiracy on my part. It is not doing well.
What model CPUs did you use for your comparison? Cache levels, clock speed, and many other factors play into CPU performance.
Clock speed is not very relevant because these numbers are per-cycle. Cache is not also very relevant since these are not memory bound benchmarks.
Couple of open questions:
– were the Spectre and following mitigations applied on both rigs? That can go a long way explaining differences in the ~10/15% range, but not a x2 factor of course
if the build is CPU specific, counting instructions seems like a weird way to measure performance, since as you mentioned some instructions are a lot wider than others. By this metric an AVX512 build of a given benchmark could give pretty bad results when compared to an SSE build (which is not true with any metric which actually counts in that case, like throughput or perf/watt)
if the build is not CPU specific, counting instruction throughput is only interesting if this is a close enough optimal build for both, IMO. One could imagine a CPU which is very good at extracting ILP from low performance builds, which would be a nice skill but could be useless in an HPC context for instance.
you keep mentioning an “old Intel CPU”, but skylake is basically the only available architecture for anything but some thin laptops. So it’s both “old” and “current”, which contributes to making AMD competitive
this being said I agree with your initial point that “better IPC” claims are not really qualified. I guess that the implicit meaning is “getting more work done per clock cycle”.
another point is that you discard benchmarks which are memory bound, but that goes against some other tests that you did concerning memory requests parallelism for instance. Extracting good IPC in memory starved contexts is also meaningful, right ?
In memory starved contexts, the number of instructions being retired is probably not the measure you care about. Instead, you might want to report the effective bandwidth or something that has to do with the actual bottleneck.
I disagree on that, it is my understanding that the IPC that manage to go through would be a good proxy for job being done, despite the bottleneck. This whole “job being done” is I think the logic behind most of the “IPC” claims around
We can reason about IPC for instruction-dense code. We know what 4.0 instructions per cycle means: it is great. For instruction-dense code, 1.0 is going to be mediocre. Basically, we have a measure of how superscalar (wide) the processor is. Achieving 6 instructions per cycle in real code would be fantastic.
For memory-bound problems, what would be a good IPC?… is 0.1 instructions per cycle good or bad? I can’t reason about it. I have some idea of what a bandwidth of 10 GB/s in random access means (it is very good).
In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. On a per-cycle basis, in this AVX2-intensive benchmark, AMD comes down under Intel in every way.
That is true.
True. There is just one x64 build here, same binary throughout.
I’m just a lowly programmer, but the speed of a single gcc compile seems irrelevant to me. Both processors perform small tasks it in the blink of an eye, so if that blink is 10ms or 12ms isn’t usually very significant to me. Small compiles are pretty equivalent in terms of time required for gcc, and trying to extrapolate those small compile results on a single core to a large number of cores is missing the point. What is relevant is if you use a parallel make (make -j) with a large number of source files and cores like 24, 32, or 64, the AMD processor will usually beat the pants off a processor with a lower number of cores. Same with rendering and many other long tasks. That’s significant to me in my wall-clock development time. Sure, sometimes the AMD may be a bit slower on small tasks, but small tasks don’t take long, so a single TR core is usually only fractionally slower and works fine for small tasks anyway. TR is fast enough to game on my PC with maximum settings on almost every title at 1440p and above. Not that I game much, but it’s fine. Same for analytical graphics, they just don’t take that long on today’s 3000 Nvidia GPU’s. I’ll admit that there are some tasks where bleeding edge core speed is important, but Ive noticed I don’t do those things TA’s often as I compile, for example.
Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison. Many choices made based on benchmark results might have gone a different way on the AMD machine, so the Intel specific quirks get built in this way.
I’m not saying it would reverse the conclusion in this case, but it’s something to remember when testing something that has been carefully tuned.
That’s true, so it is a bias but I submit to you that the same bias exists on highly tuned software out there.
Furthermore, when people say that AMD Zen 2 has superior IPC, they rarely qualify this statement by saying that it requires tuning or recompiling the software. If that’s a requirement, it should be stated.
Agreed, it is a bias that applies to other software, although I suspect SIMDJson is more highly tuned than the average, so I suggest it applies more in this case.
I don’t know about higher IPC, but when I say something like “Zen 2 has comparable IPC to Skylake” I don’t mean after recompiling. I just draw that conclusion from broad-based tests performed by others, on existing binaries without recompiling.
The IPC relationship between two different uarches isn’t constant across benchmarks so “comparable IPC on average across a range of benchmarks” doesn’t translate to “comparable IPC on every benchmark”. Quite the opposite, I’d expect any given benchmark to show an advantage for one platform or the other since they are not the same.
What you describe matches my expectation but I feel that there is some amount of hype in favor of AMD.
I can’t speak for everyone, but from my part the hype isn’t that Zen 2 has higher IPC than Intel, or has released a better uarch than Intel, but that AMD has something at least roughly comparable, on average and is making it available at prices and core counts that undercut Intel by 50% or more.
After years of release the Skylake chip under a new name, and increasing the price each time, Intel has slashed many of their new chips by half over the old lines, and core counts on all parts are suddenly shooting up.
That’s what’s deserving of hype, not big microarchitectural improvements. From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!
“From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!”
Great quote.
How can I reproduce your results for the second table?
I went into the
2019/05/03
folder, ran make and ran the resulting./bitmapdecoding
binary and look at the reported “instructions per cycle” valueI consistently get IPC 1.76 or 1.77 for Intel and 1.43 or 1.44 for AMD Zen 2 (on skylake-x and rome servers, respectively).
I tried on Skylake server (rather than Skylake-X) and got an IPC of 2.00, which is closer but still not 2.8.
It’s weird there is such a difference between SKL and SKX here.
Some time ago, I revised the post to 2.1 from 2.8. You have access to my skylake-x box.
Did you collect your results on SKX or SKL?
I think that my dump above is from SKX.
Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.
Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit. Intel designed it and although AMD can use it due to their licence agreement but incorporating it takes a long time and typically architecture change and is not something that can just be added. Clock speed boosts vary quite a lot on load, if the load is short Intel will boost to maximum clock speed and never stabilise at the lower frequency after 30 seconds which you would normally see which can invalidate the results along with motherboards that may enable MCE as always on.
AMD don’t do instruction sets anymore as applications typically geared towards the most common MMX was preferred over AMD 3D Now even if 3D now was more efficient but Intel like it is today has a bigger market thus not worth investing in a specific instruction set when the devices are limited.
Ryzen architecture boosts vary quite a bit depending on background tasks running it could be 4.6GHz or it could be 4GHz. Due to the nature of the architecture and the way it boosts it can only maintain that frequency for a short time, less than a second and the load is moved over to another core this includes flushing the data from L1 and L2 cache and move over to the to other core then boost the new core at a higher frequency to continue the task this can add latency but is quite small as it would typically be in the same CCX.
When people go on about Zen 2 having a higher IPC they have it as apples to apples, i.e All CPU’s run at the same clock speed and see what architecture differences there are that distinguish from each at a given clock frequency without random boost clock speeds and use more generalised instruction sets like SSE4 skewing the results. If you have something that boosts randomly due to CPU temps, voltage, current ripple, VRM temps even ambient temperatures. If you can’t set a base line and have so many variables into your results then the end result is also useless
Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.
These are not memory-bound tests. The processor with the highest frequency in these tests is Skylake. Given that memory access is not a significant burden here, and that we report “per cycle” instructions, it is ok not to mention frequency. But if we do, then the Intel Skylake processor is maybe at a disadvantage.
Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit.
No, none of this code uses AVX-512.
If you can’t set a base line and have so many variables into your results then the end result is also useless
I disagree. I can measure reliably how many cycles a computationally intensive tasks take. Yes, if there are expensive cache misses, then we have an issue, but it is not the case here.
Those influencers online that post stuff without specifying a lot of things. Short boost clock is extremely important in your test and that is what intel cpu’s are good at. If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
Short boost clock is extremely important in your test and that is what intel cpu’s are good at.
Short boost in the clock frequency would not be relevant. If anything, as Yoav pointed out in another comment, higher frequencies in Intel would mean lower IPC whenever memory latency is at issue.
If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
Please read my post again. I am explicit is stating that I believe AMD has probably better processors than Intel at this point. All I am saying is that we should qualify these statements.
I see the flag
-march=native
in the Makefile. When these containers are built, which system is used?In my tests, it is the same binary across systems.
I think Darien understands that it is the same binary, and is asking a different question. Since you used “-march=native”, you might get a different compilation depending on whether you generated the binary on AMD or Intel. In theory, this binary might always be faster on the machine it is compiled on than on the opposite machine. In practice, the assembly here is straightforward enough that this is unlikely the case. But it’s still a question worth asking, and worth answering.
I agree.
Frequency is very important when measuring IPC. This is because memory latency doesn’t scale with frequency. So higher frequency means usually lower IPC.
Also the memory speed is very important.
@Yoav Memory latency is not the issue here. This being said, the Intel Skylake processor has a higher frequency so if there is any frequency-related bias, it would be favorable to AMD Rome.
These results are really odd. The Zen core has much wider decoding and far more pipelines that can complete instructions. Did you optimize for both systems or just the skylake one?
The simdjson library targets westmere at the compiler level, the AVX code is written with intrinsics.
So yes, purely optimized for skylake and likely with optimizations that harm zen.
I disagree. Neither of my tests is optimized for skylake or against zen.
Then you don’t understand microarchitecture. Nehalem has the same core layout and scheduler layout as Skylake and Cannon lake. If it’s optimized for one, it’s optimized for all of them on the base level. Zen has a very different layout for both core and scheduler.
Skylake has 5 decoders and Zen has 4, but Zen can pack up to 8 instructions for those 4 and Skylake only 6. And that’s just the first example.
The simdjson is an open project and we could use help from someone who can help us optimize the code for Zen architectures. Please help out.
Even on AMD, it is the fastest JSON parser in existence as far as I know.
After I code my game engine, maybe I’ll pop in and restructure some things. Maybe.
Zen and Skylake are way more similar than either of those are to Nehalem.
In general though compilers favor code that is faster on Intel than AMD.
Who’s claiming Zen 2 has beat Intel on IPC? All the benches Ive seen puts the top mainstream i9 9900 and/or 9700 on top even against the mighty 3950x. The claim I’ve seen is that AMD has narrowed the IPC gap significantly while destroying Intel on multithreaded tasks by a very very large margin. It is my understanding that the IPC gap between Zen2 and 9th gen is small enough that it’s a better value to go with a Zen2 for a more robust CPU if you are a mix usage, content creation and gaming, etc.
Zen2 is not exactly cheap also. Platform costs (X570), memory cost (higher bandwidth memory) makes it a bit more expensive than an i9. The good news is that you can pop in a Zen2 into an X370 mobo…
It’s an exciting time for the PC market. competition drives innovation!
The link I offer in my post is one instance where folks claim that Zen 2 has better IPC.
I agree with your comment.
This article feels so unfinished.
Nothing specified about test setup, and answer is always avoided.
What specs (chipset, CPU, RAM, OS)? What settings for CPU and RAM?
Why so specific software and such small selection? Maybe inclusion of actual performance (not IPC) at the same clockspeed.
This blog post is specifically about IPC, not performance (please see the title).
The microarchitectures are specified; RAM and operating systems are not relevant.
Why so specific software and such small selection?
Because this is the software I care about. You will undoubtably run different code and software. That’s fine.
Came across this article on HN,
The problem is Guru3D, they are is not a technical site, so they got IPC wrong. But IPC in everybody’s ( or normal peoples ) terms is exactly that they / you describe, work per clock, so in that sense it is right for their target audience. ( May be it should be used as PPC, performance per clock )
Having said that even the AMD bias site and fans dont ever claim Zen 2 has better than Intel’s Single Core / Thread / IPC performance. Having better IPC / PPC is absolutely not the mainstream sentiment. As a matter of fact this is the first time I heard about it having casually serving a dozen of tech sites and social media.
See it’s not really the sites that’s at fault, it’t the manufactures that indicate what the “IPC” uplift is and this is before clock speeds are taking into account because they are at the stage in development where clock speeds have not yet been finalised as of yet thus sites like Guru3D etc set a baseline and to confirm if a specific manufacture is accurate in their assessment. Now maybe PPC would be a better assessment and neither AMD or Intel tried to diverge from it.
Intel typically will say 2% IPC gain before frequency as does AMD, while Intels typical 2% is within margin are error and people are like meh, no one really cares at that point but some people do indeed test it and do typically see a 2% uplift. They do it like this as both Intel and AMD are competitors yet at the same time they want people to be interested in their product while not giving away the full performance that frequency contributes to it.
People are more interested with AMD since when Zen was first launched 52% over piledriver, Zen+ 5%, Zen 2 13% and the upcoming Zen 3 15% with these percentages from Zen+ being compared to first Gen Zen, these are sizeable gains outside of margin of error. Then of course Intel and their Ice Lake’s 18% IPC uplift. How sites test the devices to determine if they are true or not, as in the past they have been over inflated but thus far it has been accurate at least for the Zen microarchitecture. Thus this is how a majority of people how/now understand what IPC is.
I’m not necessarily a mega AMD fanboi and tend to suspect that the underlying point may still hold up. But this article feels very dubious since it carefully selected to benchmarks that would specifically be MOSTLY using AVX512 instructions on Intel, which AMD doesn’t have implemented. I’d like to see the performance difference using only x86-64 instructions with no SSE/AVX
Two*
There is no AVX-512 anywhere. I assure you.
Since tech sites compare Intel/AMD CPU performance, they are talking about the same binary file (and the same list of instrunctions to be executed), running on different x86-64 compatible CPUs.
In this specific but very common scenario, for any benchmark of fixed amount of work/instructions, “work per unit of time normalized per CPU frequency” = constant “instruction count of work” * “instructions per cycle”. These two work the same way hen comparing architectures.
So I suppose tech sites are not wrong. You and they just use different benchmarks and get different results.
If you present a plot where on the y-axis you claim to present the number of instructions per cycle and you give some other number, you are making a mistake.
For many benchmarks, the number of instructions is not the proper benchmark.
Furthermore, even if you have the same binary, there is no reason to think that the processors will execute the same instructions. Branch predictors, difference in ISA can all trigger different code paths.
From where I stand, work per unit of time normalized per CPU frequency, is the only consequential measure of IPC. IPC numbers offered by manufacturer only tell us about whether later generations of CPU achieve higher IPC than earlier generations of CPU. That is a relative measure only. It is nice to know that IPC is improving over successive years by this or that percentage but putting meat to the bones of IPC involves determining what those instructions are worth. You can only discover that by running appropriate benchmarks. And, that is why I say that work per unit of time normalized per CPU frequency is the more substantial way of thinking about IPC, whereas manufacturers numbers hold less significance.