5th December 2019, 48 min read

Instructions per cycle: AMD Zen 2 versus Intel

62 thoughts on “Instructions per cycle: AMD Zen 2 versus Intel”

Paul Masurel says:

December 5, 2019 at 4:08 am

The last table shows twice the IPC for the Zen 2, which is in contradiction to your conclusion. Did you swap the two value by any chance?
1. Ivan Bobev says:
  
  December 5, 2019 at 8:49 am
  
  I’m wondering about the same thing. 🙂
2. Daniel Lemire says:
  
  December 5, 2019 at 12:47 pm
  
  Yes, I entered the numbers in reverse, this has been fixed.
Nigel Horspool says:

December 5, 2019 at 8:12 am

What is really making Intel nervous is that at each price point, the AMD processor has many more cores than the corresponding Intel processor AND consumes less power (an important issue for server farms).
1. Daniel Lemire says:
  
  December 5, 2019 at 5:10 pm
  
  Nigel: I am quite excited about AMD being back in the race…
Jens Nurmann says:

December 5, 2019 at 1:21 pm

The second one seems to be a frustrating example of “implementational divergence due to instruction set bloat” on the AMD side IMHO. Looking up instruction throughput on Zen 2 one finds

bsf / bsr – 3 / 4 cycles on r64
tzcnt / lzcnt – 0.5 / 1 cycles on r64

I’d assume that the compiler generates bsf in your benchmark – if it is the one you presented some time ago. So I am surprised that this is “only” 1/2 of Intel IPC for AMD in this case. Replacing bsf with tzcnt might reverse the situation.
1. Daniel Lemire says:
  
  December 5, 2019 at 4:31 pm
  
  I have added my code to the blog post so that it is clearer, I specifically request tzcnt.
  
  So, I don’t think that’s the issue.
  1. Jens Nurmann says:
    
    December 5, 2019 at 4:51 pm
    
    Now I am surprised – most vexing. I’ll try to take a closer look at that.
2. Travis Downs says:
  
  December 6, 2019 at 2:31 pm
  
  I checked the assembly and tzcnt is generated.
  1. Daniel Lemire says:
    
    December 6, 2019 at 2:39 pm
    
    Thanks Travis.
    
    I don’t know why Zen 2 is inferior on this test but it is no conspiracy on my part. It is not doing well.
Jeremiah Hoyet says:

December 5, 2019 at 3:15 pm

What model CPUs did you use for your comparison? Cache levels, clock speed, and many other factors play into CPU performance.
1. Daniel Lemire says:
  
  December 5, 2019 at 4:15 pm
  
  Clock speed is not very relevant because these numbers are per-cycle. Cache is not also very relevant since these are not memory bound benchmarks.
Benjamin says:

December 5, 2019 at 5:16 pm

Couple of open questions:
– were the Spectre and following mitigations applied on both rigs? That can go a long way explaining differences in the ~10/15% range, but not a x2 factor of course

if the build is CPU specific, counting instructions seems like a weird way to measure performance, since as you mentioned some instructions are a lot wider than others. By this metric an AVX512 build of a given benchmark could give pretty bad results when compared to an SSE build (which is not true with any metric which actually counts in that case, like throughput or perf/watt)
if the build is not CPU specific, counting instruction throughput is only interesting if this is a close enough optimal build for both, IMO. One could imagine a CPU which is very good at extracting ILP from low performance builds, which would be a nice skill but could be useless in an HPC context for instance.

you keep mentioning an “old Intel CPU”, but skylake is basically the only available architecture for anything but some thin laptops. So it’s both “old” and “current”, which contributes to making AMD competitive

this being said I agree with your initial point that “better IPC” claims are not really qualified. I guess that the implicit meaning is “getting more work done per clock cycle”.
1. Benjamin says:
  
  December 5, 2019 at 5:19 pm
  
  another point is that you discard benchmarks which are memory bound, but that goes against some other tests that you did concerning memory requests parallelism for instance. Extracting good IPC in memory starved contexts is also meaningful, right ?
  1. Daniel Lemire says:
    
    December 5, 2019 at 5:32 pm
    
    In memory starved contexts, the number of instructions being retired is probably not the measure you care about. Instead, you might want to report the effective bandwidth or something that has to do with the actual bottleneck.
    1. Benjamin says:
      
      December 5, 2019 at 7:06 pm
      
      I disagree on that, it is my understanding that the IPC that manage to go through would be a good proxy for job being done, despite the bottleneck. This whole “job being done” is I think the logic behind most of the “IPC” claims around
      1. Daniel Lemire says:
        
        December 5, 2019 at 7:37 pm
        
        We can reason about IPC for instruction-dense code. We know what 4.0 instructions per cycle means: it is great. For instruction-dense code, 1.0 is going to be mediocre. Basically, we have a measure of how superscalar (wide) the processor is. Achieving 6 instructions per cycle in real code would be fantastic.
        
        For memory-bound problems, what would be a good IPC?… is 0.1 instructions per cycle good or bad? I can’t reason about it. I have some idea of what a bandwidth of 10 GB/s in random access means (it is very good).
2. Daniel Lemire says:
  
  December 5, 2019 at 5:38 pm
  
  In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. On a per-cycle basis, in this AVX2-intensive benchmark, AMD comes down under Intel in every way.
  
  you keep mentioning an “old Intel CPU”, but skylake is basically the
  only available architecture for anything but some thin laptops. So
  it’s both “old” and “current”, which contributes to making AMD
  competitive
  
  That is true.
  1. Benjamin says:
    
    December 5, 2019 at 7:17 pm
    
    In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. Changing the builds (for instance -O3 vs vanilla) would change the instruction mix and throughput, all other things (task and hardware) being equal. So the correct quote is “for a given build, the number of instructions retired per cycle follows closely the performance per cycle.”, which may or may not be a good proxy for absolute performance (see AVX512 for instance)
    1. Daniel Lemire says:
      
      December 5, 2019 at 10:12 pm
      
      True. There is just one x64 build here, same binary throughout.
    2. RGRHON says:
      
      May 9, 2021 at 7:19 am
      
      I’m just a lowly programmer, but the speed of a single gcc compile seems irrelevant to me. Both processors perform small tasks it in the blink of an eye, so if that blink is 10ms or 12ms isn’t usually very significant to me. Small compiles are pretty equivalent in terms of time required for gcc, and trying to extrapolate those small compile results on a single core to a large number of cores is missing the point. What is relevant is if you use a parallel make (make -j) with a large number of source files and cores like 24, 32, or 64, the AMD processor will usually beat the pants off a processor with a lower number of cores. Same with rendering and many other long tasks. That’s significant to me in my wall-clock development time. Sure, sometimes the AMD may be a bit slower on small tasks, but small tasks don’t take long, so a single TR core is usually only fractionally slower and works fine for small tasks anyway. TR is fast enough to game on my PC with maximum settings on almost every title at 1440p and above. Not that I game much, but it’s fine. Same for analytical graphics, they just don’t take that long on today’s 3000 Nvidia GPU’s. I’ll admit that there are some tasks where bleeding edge core speed is important, but Ive noticed I don’t do those things TA’s often as I compile, for example.
  2. Travis Downs says:
    
    December 6, 2019 at 2:12 am
    
    Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison. Many choices made based on benchmark results might have gone a different way on the AMD machine, so the Intel specific quirks get built in this way.
    
    I’m not saying it would reverse the conclusion in this case, but it’s something to remember when testing something that has been carefully tuned.
    1. Daniel Lemire says:
      
      December 6, 2019 at 3:56 pm
      
      Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison.
      
      That’s true, so it is a bias but I submit to you that the same bias exists on highly tuned software out there.
      
      Furthermore, when people say that AMD Zen 2 has superior IPC, they rarely qualify this statement by saying that it requires tuning or recompiling the software. If that’s a requirement, it should be stated.
      1. Travis Downs says:
        
        December 6, 2019 at 6:31 pm
        
        Agreed, it is a bias that applies to other software, although I suspect SIMDJson is more highly tuned than the average, so I suggest it applies more in this case.
        
        I don’t know about higher IPC, but when I say something like “Zen 2 has comparable IPC to Skylake” I don’t mean after recompiling. I just draw that conclusion from broad-based tests performed by others, on existing binaries without recompiling.
        
        The IPC relationship between two different uarches isn’t constant across benchmarks so “comparable IPC on average across a range of benchmarks” doesn’t translate to “comparable IPC on every benchmark”. Quite the opposite, I’d expect any given benchmark to show an advantage for one platform or the other since they are not the same.
        
        Daniel Lemire says:
        
        December 6, 2019 at 6:36 pm
        
        What you describe matches my expectation but I feel that there is some amount of hype in favor of AMD.
        
        Travis Downs says:
        
        December 6, 2019 at 6:45 pm
        
        I can’t speak for everyone, but from my part the hype isn’t that Zen 2 has higher IPC than Intel, or has released a better uarch than Intel, but that AMD has something at least roughly comparable, on average and is making it available at prices and core counts that undercut Intel by 50% or more.
        
        After years of release the Skylake chip under a new name, and increasing the price each time, Intel has slashed many of their new chips by half over the old lines, and core counts on all parts are suddenly shooting up.
        
        That’s what’s deserving of hype, not big microarchitectural improvements. From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!
        
        Daniel Lemire says:
        
        December 6, 2019 at 6:56 pm
        
        “From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!”
        
        Great quote.

Travis Downs says:

December 6, 2019 at 1:47 am

How can I reproduce your results for the second table?

I went into the 2019/05/03 folder, ran make and ran the resulting ./bitmapdecoding binary and look at the reported “instructions per cycle” value

I consistently get IPC 1.76 or 1.77 for Intel and 1.43 or 1.44 for AMD Zen 2 (on skylake-x and rome servers, respectively).

Travis Downs says:

December 6, 2019 at 2:04 am

I tried on Skylake server (rather than Skylake-X) and got an IPC of 2.00, which is closer but still not 2.8.

It’s weird there is such a difference between SKL and SKX here.

Daniel Lemire says:

December 6, 2019 at 3:52 pm

Some time ago, I revised the post to 2.1 from 2.8. You have access to my skylake-x box.

$ ./dockerscript.sh x86_64 $uname_p is [x86_64] rm -r -f bitmapdecoding bitmapdecoding.s bitmapdecoding_countbranch sanibitmapdecoding Sending build context to Docker daemon 4.946MB Step 1/6 : FROM gcc:9.1 ---> c7637321bf71 Step 2/6 : COPY . /usr/src/myapp ---> Using cache ---> e09e94f2e750 Step 3/6 : WORKDIR /usr/src/myapp ---> Using cache ---> c41c4c3dcc24 Step 4/6 : RUN make clean ---> Using cache ---> a9a9b7ea79ba Step 5/6 : RUN make ---> Using cache ---> 4fbe6b324968 Step 6/6 : CMD ["./bitmapdecoding", "test"] ---> Using cache ---> 969acefc46b8 Successfully built 969acefc46b8 Successfully tagged my-gcc-app:latest just scanning: bogus .matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097 bytes per index = 10.497 instructions per cycle 3.85, cycles per value set: 0.341, instructions per value set: 1.312, cycles per word: 2.078, instructions per word: 8.001 cycles per input byte 0.03 instructions per input byte 0.13 min: 44301 cycles, 170595 instructions, 2 branch mis., 4 cache ref., 2 cache mis. avg: 57490.1 cycles, 170607.6 instructions, 3.0 branch mis., 960.9 cache ref., 75.0 cache mis.

simdjson_decoder: Tests passed. matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097 bytes per index = 10.497 instructions per cycle 2.48, cycles per value set: 4.018, instructions per value set: 9.974, cycles per word: 24.499, instructions per word: 60.812 cycles per input byte 0.38 instructions per input byte 0.95 min: 522358 cycles, 1296638 instructions, 8171 branch mis., 254 cache ref., 0 cache mis. avg: 536409.7 cycles, 1296650.8 instructions, 8351.8 branch mis., 1111.6 cache ref., 5.7 cache mis.

Tests passed. basic_decoder: matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097 bytes per index = 10.497 instructions per cycle 2.06, cycles per value set: 4.499, instructions per value set: 9.283, cycles per word: 27.432, instructions per word: 56.599 cycles per input byte 0.43 instructions per input byte 0.88 min: 584913 cycles, 1206795 instructions, 14050 branch mis., 251 cache ref., 0 cache mis. avg: 592380.1 cycles, 1206795.0 instructions, 14329.2 branch mis., 1269.4 cache ref., 79.9 cache mis.

Travis Downs says:

December 6, 2019 at 6:42 pm

Did you collect your results on SKX or SKL?
1. Daniel Lemire says:
  
  December 6, 2019 at 6:45 pm
  
  I think that my dump above is from SKX.

Darren Rushworth-Moore says:

December 6, 2019 at 3:01 am

Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.

Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit. Intel designed it and although AMD can use it due to their licence agreement but incorporating it takes a long time and typically architecture change and is not something that can just be added. Clock speed boosts vary quite a lot on load, if the load is short Intel will boost to maximum clock speed and never stabilise at the lower frequency after 30 seconds which you would normally see which can invalidate the results along with motherboards that may enable MCE as always on.

AMD don’t do instruction sets anymore as applications typically geared towards the most common MMX was preferred over AMD 3D Now even if 3D now was more efficient but Intel like it is today has a bigger market thus not worth investing in a specific instruction set when the devices are limited.

Ryzen architecture boosts vary quite a bit depending on background tasks running it could be 4.6GHz or it could be 4GHz. Due to the nature of the architecture and the way it boosts it can only maintain that frequency for a short time, less than a second and the load is moved over to another core this includes flushing the data from L1 and L2 cache and move over to the to other core then boost the new core at a higher frequency to continue the task this can add latency but is quite small as it would typically be in the same CCX.

When people go on about Zen 2 having a higher IPC they have it as apples to apples, i.e All CPU’s run at the same clock speed and see what architecture differences there are that distinguish from each at a given clock frequency without random boost clock speeds and use more generalised instruction sets like SSE4 skewing the results. If you have something that boosts randomly due to CPU temps, voltage, current ripple, VRM temps even ambient temperatures. If you can’t set a base line and have so many variables into your results then the end result is also useless
1. Daniel Lemire says:
  
  December 6, 2019 at 1:41 pm
  
  Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.
  
  These are not memory-bound tests. The processor with the highest frequency in these tests is Skylake. Given that memory access is not a significant burden here, and that we report “per cycle” instructions, it is ok not to mention frequency. But if we do, then the Intel Skylake processor is maybe at a disadvantage.
  
  Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit.
  
  No, none of this code uses AVX-512.
  
  If you can’t set a base line and have so many variables into your results then the end result is also useless
  
  I disagree. I can measure reliably how many cycles a computationally intensive tasks take. Yes, if there are expensive cache misses, then we have an issue, but it is not the case here.
Archie says:

December 6, 2019 at 7:49 am

Those influencers online that post stuff without specifying a lot of things. Short boost clock is extremely important in your test and that is what intel cpu’s are good at. If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
1. Daniel Lemire says:
  
  December 6, 2019 at 1:36 pm
  
  Short boost clock is extremely important in your test and that is what intel cpu’s are good at.
  
  Short boost in the clock frequency would not be relevant. If anything, as Yoav pointed out in another comment, higher frequencies in Intel would mean lower IPC whenever memory latency is at issue.
  
  If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.
  
  Please read my post again. I am explicit is stating that I believe AMD has probably better processors than Intel at this point. All I am saying is that we should qualify these statements.
Darien says:

December 6, 2019 at 8:04 am

I see the flag -march=native in the Makefile. When these containers are built, which system is used?
1. Daniel Lemire says:
  
  December 6, 2019 at 1:42 pm
  
  In my tests, it is the same binary across systems.
  1. Nathan Kurz says:
    
    December 6, 2019 at 10:41 pm
    
    I think Darien understands that it is the same binary, and is asking a different question. Since you used “-march=native”, you might get a different compilation depending on whether you generated the binary on AMD or Intel. In theory, this binary might always be faster on the machine it is compiled on than on the opposite machine. In practice, the assembly here is straightforward enough that this is unlikely the case. But it’s still a question worth asking, and worth answering.
    1. Daniel Lemire says:
      
      December 7, 2019 at 2:03 am
      
      I agree.
Yoav says:

December 6, 2019 at 8:59 am

Frequency is very important when measuring IPC. This is because memory latency doesn’t scale with frequency. So higher frequency means usually lower IPC.
Also the memory speed is very important.
1. Daniel Lemire says:
  
  December 6, 2019 at 1:33 pm
  
  @Yoav Memory latency is not the issue here. This being said, the Intel Skylake processor has a higher frequency so if there is any frequency-related bias, it would be favorable to AMD Rome.
Matthew Montgomery says:

December 6, 2019 at 1:51 pm

These results are really odd. The Zen core has much wider decoding and far more pipelines that can complete instructions. Did you optimize for both systems or just the skylake one?
1. Daniel Lemire says:
  
  December 6, 2019 at 2:06 pm
  
  The simdjson library targets westmere at the compiler level, the AVX code is written with intrinsics.
  1. Matthew Montgomery says:
    
    December 6, 2019 at 2:32 pm
    
    So yes, purely optimized for skylake and likely with optimizations that harm zen.
    1. Daniel Lemire says:
      
      December 6, 2019 at 2:36 pm
      
      I disagree. Neither of my tests is optimized for skylake or against zen.
      1. Matthew Montgomery says:
        
        December 6, 2019 at 3:16 pm
        
        Then you don’t understand microarchitecture. Nehalem has the same core layout and scheduler layout as Skylake and Cannon lake. If it’s optimized for one, it’s optimized for all of them on the base level. Zen has a very different layout for both core and scheduler.
        
        Skylake has 5 decoders and Zen has 4, but Zen can pack up to 8 instructions for those 4 and Skylake only 6. And that’s just the first example.
        
        Daniel Lemire says:
        
        December 6, 2019 at 3:31 pm
        
        The simdjson is an open project and we could use help from someone who can help us optimize the code for Zen architectures. Please help out.
        
        Even on AMD, it is the fastest JSON parser in existence as far as I know.
        
        Matthew Montgomery says:
        
        December 6, 2019 at 3:40 pm
        
        After I code my game engine, maybe I’ll pop in and restructure some things. Maybe.
        
        Travis Downs says:
        
        December 6, 2019 at 5:24 pm
        
        Zen and Skylake are way more similar than either of those are to Nehalem.
        
        In general though compilers favor code that is faster on Intel than AMD.
blah says:

December 6, 2019 at 2:35 pm

Who’s claiming Zen 2 has beat Intel on IPC? All the benches Ive seen puts the top mainstream i9 9900 and/or 9700 on top even against the mighty 3950x. The claim I’ve seen is that AMD has narrowed the IPC gap significantly while destroying Intel on multithreaded tasks by a very very large margin. It is my understanding that the IPC gap between Zen2 and 9th gen is small enough that it’s a better value to go with a Zen2 for a more robust CPU if you are a mix usage, content creation and gaming, etc.

Zen2 is not exactly cheap also. Platform costs (X570), memory cost (higher bandwidth memory) makes it a bit more expensive than an i9. The good news is that you can pop in a Zen2 into an X370 mobo…

It’s an exciting time for the PC market. competition drives innovation!
1. Daniel Lemire says:
  
  December 6, 2019 at 7:44 pm
  
  The link I offer in my post is one instance where folks claim that Zen 2 has better IPC.
  
  I agree with your comment.
Roland Homoki says:

December 6, 2019 at 7:31 pm

This article feels so unfinished.
Nothing specified about test setup, and answer is always avoided.
What specs (chipset, CPU, RAM, OS)? What settings for CPU and RAM?
Why so specific software and such small selection? Maybe inclusion of actual performance (not IPC) at the same clockspeed.
1. Daniel Lemire says:
  
  December 6, 2019 at 7:36 pm
  
  This blog post is specifically about IPC, not performance (please see the title).
  
  The microarchitectures are specified; RAM and operating systems are not relevant.
  
  Why so specific software and such small selection?
  
  Because this is the software I care about. You will undoubtably run different code and software. That’s fine.
Ksec says:

December 6, 2019 at 8:36 pm

Came across this article on HN,

The problem is Guru3D, they are is not a technical site, so they got IPC wrong. But IPC in everybody’s ( or normal peoples ) terms is exactly that they / you describe, work per clock, so in that sense it is right for their target audience. ( May be it should be used as PPC, performance per clock )

Having said that even the AMD bias site and fans dont ever claim Zen 2 has better than Intel’s Single Core / Thread / IPC performance. Having better IPC / PPC is absolutely not the mainstream sentiment. As a matter of fact this is the first time I heard about it having casually serving a dozen of tech sites and social media.
1. Darren Rushworth-Moore says:
  
  December 6, 2019 at 10:46 pm
  
  See it’s not really the sites that’s at fault, it’t the manufactures that indicate what the “IPC” uplift is and this is before clock speeds are taking into account because they are at the stage in development where clock speeds have not yet been finalised as of yet thus sites like Guru3D etc set a baseline and to confirm if a specific manufacture is accurate in their assessment. Now maybe PPC would be a better assessment and neither AMD or Intel tried to diverge from it.
  
  Intel typically will say 2% IPC gain before frequency as does AMD, while Intels typical 2% is within margin are error and people are like meh, no one really cares at that point but some people do indeed test it and do typically see a 2% uplift. They do it like this as both Intel and AMD are competitors yet at the same time they want people to be interested in their product while not giving away the full performance that frequency contributes to it.
  
  People are more interested with AMD since when Zen was first launched 52% over piledriver, Zen+ 5%, Zen 2 13% and the upcoming Zen 3 15% with these percentages from Zen+ being compared to first Gen Zen, these are sizeable gains outside of margin of error. Then of course Intel and their Ice Lake’s 18% IPC uplift. How sites test the devices to determine if they are true or not, as in the past they have been over inflated but thus far it has been accurate at least for the Zen microarchitecture. Thus this is how a majority of people how/now understand what IPC is.
Keef says:

December 7, 2019 at 1:21 am

I’m not necessarily a mega AMD fanboi and tend to suspect that the underlying point may still hold up. But this article feels very dubious since it carefully selected to benchmarks that would specifically be MOSTLY using AVX512 instructions on Intel, which AMD doesn’t have implemented. I’d like to see the performance difference using only x86-64 instructions with no SSE/AVX
1. Keef says:
  
  December 7, 2019 at 1:22 am
  
  Two*
2. Daniel Lemire says:
  
  December 7, 2019 at 2:50 am
  
  There is no AVX-512 anywhere. I assure you.
misdake says:

December 12, 2019 at 4:43 am

Since tech sites compare Intel/AMD CPU performance, they are talking about the same binary file (and the same list of instrunctions to be executed), running on different x86-64 compatible CPUs.

In this specific but very common scenario, for any benchmark of fixed amount of work/instructions, “work per unit of time normalized per CPU frequency” = constant “instruction count of work” * “instructions per cycle”. These two work the same way hen comparing architectures.

So I suppose tech sites are not wrong. You and they just use different benchmarks and get different results.
1. Daniel Lemire says:
  
  December 12, 2019 at 2:24 pm
  
  If you present a plot where on the y-axis you claim to present the number of instructions per cycle and you give some other number, you are making a mistake.
  
  For many benchmarks, the number of instructions is not the proper benchmark.
  
  Furthermore, even if you have the same binary, there is no reason to think that the processors will execute the same instructions. Branch predictors, difference in ISA can all trigger different code paths.
ChrisGX says:

August 22, 2020 at 3:31 pm

From where I stand, work per unit of time normalized per CPU frequency, is the only consequential measure of IPC. IPC numbers offered by manufacturer only tell us about whether later generations of CPU achieve higher IPC than earlier generations of CPU. That is a relative measure only. It is nice to know that IPC is improving over successive years by this or that percentage but putting meat to the bones of IPC involves determining what those instructions are worth. You can only discover that by running appropriate benchmarks. And, that is why I say that work per unit of time normalized per CPU frequency is the more substantial way of thinking about IPC, whereas manufacturers numbers hold less significance.