11th December 2020, 30 min read

ARM MacBook vs Intel MacBook

36 thoughts on “ARM MacBook vs Intel MacBook”

Leif says:

December 12, 2020 at 2:03 am

I would try to use debug tools to generate flame graphs, or river diagrams, of where each algorithm is spending its time. Something like this example.

That might provide some insight into commonalities and differences in the underlying libraries and functions.
Michael says:

December 12, 2020 at 2:35 am

You write that “[t]he Intel processor has nifty 256-bit SIMD instructions. The Apple chip has nothing of the sort as part of its main CPU.”

The M1, like most modern ARM v8 CPUs, uses the NEON SIMD extension. The M1 has four 128-bit NEON pipelines, see the AnandTech overview.

So the SIMD unit in the M1 is only half as wide as on current x86-64 CPUs, but “nothing of the sort” sounds a bit extreme…
1. Daniel Lemire says:
  
  December 12, 2020 at 2:47 pm
  
  I am aware of NEON, but it is no match for AVX2 in general. Doubling the register width makes a big difference, at least in some cases.
  1. Royi says:
    
    December 12, 2020 at 7:44 pm
    
    I think in that regard they are on par.
    Per core the Intel usually have 2 ports for 256 Bit so in total it works on 512 Bit of data ( I am not talking about the CPU’s with AVX512, I’m talking about the Skylake derived CPU’s).
    
    The M1 has 4 units of 128 Bit each. In total it is also 512.
    Since it has much wider decoding front it won’t get hurt by not having a 256 Bit operation in a single OP.
    1. Daniel Lemire says:
      
      December 12, 2020 at 8:51 pm
      
      Do you have benchmark numbers of a comparison between AVX2 on a recent x64 processor (Intel/AMD) and the equivalent on ARM NEON?
      1. Royi says:
        
        December 13, 2020 at 8:04 am
        
        What about the SpecFP in the Anandtech review?
        I’d guess Clang will generate in many cases vectorized code so you’ll be able to see.
        
        But since you have the hardware, why not give it a try?
        
        Andrei F says:
        
        December 13, 2020 at 1:51 pm
        
        Daniel’s background stance on this type of benchmarking surrounds software with heavy usage of intrinsics and optimised routines.
        
        While the compiler will spit out some SIMD here and there where it can, SPECfp is uses general use-case code without such hand-crafted vectorisation, and as such the performance uplift and impact is very minor.
      2. Andrei F says:
        
        December 13, 2020 at 1:48 pm
        
        How can you claim NEON is no match for AVX2 and then ask for performance numbers? That’s pretty a irresponsible stance.
        
        Vector size is irrelevant to the performance discussion because each µarch will be optimised around their particular setup. The total execution throughput of the M1 isn’t any less than that of your Kaby Lake chip – which is what matters.
        
        As other have noted, there’s plenty of NEON optimised software out there and it runs perfectly fine.
        
        You can even try something a simple as a portability layer to run your own benchmarks of your own AVX2 packages:
        
        https://simd-everywhere.github.io/blog/2020/06/22/transitioning-to-arm-with-simde.html
        
        For the vast majority of cases NEON should be functionally equivalent to AVX.
        
        Daniel Lemire says:
        
        December 13, 2020 at 4:54 pm
        
        How can you claim NEON is no match for AVX2 and then ask for performance numbers? That’s pretty a irresponsible stance.
        
        I don’t think it is irresponsible to ask for performance numbers. I do not like to argue in the abstract.
        
        Daniel Lemire says:
        
        December 13, 2020 at 6:31 pm
        
        See my post ARM MacBook vs Intel MacBook: a SIMD benchmark
        
        Daniel Lemire says:
        
        December 13, 2020 at 11:01 pm
        
        BTW I was wrong. Not wrong to ask for benchmarks, but wrong in the belief that the M1 would not match AVX2.
    2. -.- says:
      
      December 12, 2020 at 10:56 pm
      
      Intel CPUs have 3x 256-bit ports, not 2x. Take note that wider SIMD doesn’t only affect the EUs, it’ll help with increasing effective PRF size, load/store etc.
      
      Of course, not all EUs support all operations, but I have no clue what the distribution is like on M1.
      1. Royi says:
        
        December 13, 2020 at 8:06 am
        
        Intel Skylake, as far I can see and tell by WikiChip Page for Skylake has port for Floating Point operations with 256 Bit Width.
        
        The server variation of Skylake has 2 x 512 Bit.
        Later architectures have some other configurations.
        
        Royi says:
        
        December 13, 2020 at 8:09 am
        
        A typo, I meant has 2 ports for Floating Point operations. Each port is capable of 256 Bit operations (AVX2).
        
        -.- says:
        
        December 13, 2020 at 11:04 pm
        
        There are 3x 256-bit ports (0, 1, 5) on Skylake. For example, Skylake can perform 3x 256b VPADDB per clock.
        If you silo yourself to FP operations only, then only ports 0 and 1 can execute them (though stuff like bitwise logic, e.g. VXORPS, can run on port 5).
        
        Note that 256b FP operations were added in AVX. AVX2 adds 256b integer operations.
        
        Royi says:
        
        December 14, 2020 at 10:43 pm
        
        Have you looked at the WikiChip architecture page?
        
        For Floating Point operations there are only 2 ports.
        
        -.- says:
        
        December 15, 2020 at 12:20 am
        
        Yes, I’ve read that page, several times in fact.
        
        Have you read and understood my previous comment? I’m guessing no, as you seem to be completely ignoring it.
        
        Daniel Lemire says:
        
        December 15, 2020 at 12:23 am
        
        You guys are saying the same thing.

Maynard Handley says:

December 13, 2020 at 1:05 am

You (and other commenters) are aware of NEON, but apparently not of AMX.
AMX may not work for the sorts of JSON parsing weirdness for which you use AVX256 (that’ll have to wait for SVE/2, probably next year) but it does solve the problem of “I want to execute dense linear algebra fast”.

You might want to run some comparisons of that for your M1 vs Intel MacBooks… The API’s to look at are in Accelerate()
https://developer.apple.com/documentation/accelerate

Daniel Lemire says:

December 13, 2020 at 6:37 pm

I am aware of the Neural Engine but I considered it to be outside of the scope of this blog post.
1. Maynard Handley says:
  
  December 13, 2020 at 8:34 pm
  
  Apple AMX (not Intel AMX) is not neural engine, it is on-CPU, no different conceptually from from NEON.
  1. Daniel Lemire says:
    
    December 14, 2020 at 11:43 pm
    
    I stand corrected but it would still be outside the scope of the blog post. No matrix multiplication in sight.

Thomas Mansencal says:

December 12, 2020 at 2:51 am

In my basic tests, I general random

“generate”

Daniel Lemire says:

December 12, 2020 at 2:47 pm

Thank you.

me says:

December 12, 2020 at 10:01 am

Can you do a IO bound benchmark as reference?
How long does it take to count the number of 1’s in the input files?

Don’t you have concerns about Apple taxing all software on OSX via the play store with 30%?

Daniel Lemire says:

December 12, 2020 at 2:37 pm

IO benchmarks are methodologically much more difficult.

Cyril says:

December 12, 2020 at 12:07 pm

It would be interesting to compare SIMD performance too. M1 has 128bit NEON registers, but 4 SIMD execution units, all with mul support, comparing to 2+1 in Kaby Lake.

Daniel Lemire says:

December 13, 2020 at 7:41 pm

I have added a SIMD benchmark.
1. Cyril says:
  
  December 13, 2020 at 7:46 pm
  
  Cool, thanks, looks very interesting. Another curious test is Lemire random number generator. M1 has 2 mul execution units for the integer pipeline, so it it can do 2 of 3 required multiplications in parallel. Probably it’s time for me to order device with M1…

Dominic Amann says:

December 12, 2020 at 3:57 pm

It would be interesting to see similar benchmarks for Risc V.

-.- says:

December 12, 2020 at 10:57 pm

I don’t believe any RISC-V processor is even remotely close to the level of performance of current top-end x86/ARM cores.

Maynard Handley says:

December 13, 2020 at 1:12 am

“I do not yet understand why the fast_float library is so much faster on the Apple M1. It contains no ARM-specific optimization.”

It’s far from perfect but XCode/Instruments gives you access to performance counters on M1. You could start by looking at the usual suspects – number of instructions executed and retired and number of branches and branch mispredicts.
(I assume both the instruction flow and data memory flow are trivial enough that they aren’t blocking. So it boils down to
– CPU width
– branch mispredicts
– ability to look ahead past shallow-ish dependency chains (ie deep issue queue)
I’m not sure how you could get at the this third one. x86 probably has a perf counter that gives the average depth of the I queue, but M1 may not make such a counter user-visible — though I expect it is there)

Daniel Lemire says:

December 13, 2020 at 5:52 pm

You could start by looking at the usual suspects – number of
instructions executed and retired and number of branches and branch
mispredicts.

Because I have studied this code a bit (with performance counters), I know that the fast_float code has very few branch mispredictions. So I do not think that branch predictions is important in the sense that I expect both processors to predict the branch very well. Of course, from that point forward, if both have eliminated the branch misprediction bottleneck, one might do better than the other at pipelining the code.

Given that I expect relatively few mispredictions, I expect that the number of instructions retired is going to be roughly the same as it would be on any other ARM processor. It is possible that Apple has some neat optimizer tricks in its version of LLVM, but this code is quite generic and boring. There is only so much Apple could do.
1. Maynard Handley says:
  
  December 15, 2020 at 2:21 am
  
  Well that’s the point isn’t it? Clarify the obvious basic things
  – same number of instructions?
  – same number of mispredicts?
  – but 1.8x the performance so more than 2x the IPC. Where’s that coming from?
  
  IF you insist on the two points stipulated above, what’s left?
  The only three issues remaining that I can see are
  – memory aliasing/forwarding. I don’t know how important that is with this type of code. Is there a lot of writing to a location then immediately reading back from that location?
  – dependency chains. If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them.
  – (the opposite of the above; dependency chains are very unimportant) ie the code does a lot of “parallel” work (many independent operations at every stage) so that Apple’s 8-wide decode and extreme flexibility in wide issue are no match for Intel’s 4 (or 5 or whatever depending on the precise details) decode width and less flexible issue.
  
  Basically where I’m coming from is that this stuff isn’t magic; there are reasons Apple achieve their 2+x IPC. But we won’t discover them if (as so much of the internet insists) every time any particular aspect of the M1 is suggested as being better than x86 (better branch prediction, better memory aliasing support, …) the immediate assumption is that either Apple is not better along that dimension or, “so what if they are, it doesn’t matter”.
  
  At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). They then both crack these in different ways, then fuse the pieces in different ways.
  My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) I’m not sure quite how one could test that claim, given that I don’t even know what performance counters Apple provides to us. But certainly on the Intel side we could learn (?)
  – instruction count
  – micro-ops counts
  – fused ops count?
  Which gives us info on that side, which we can then compare with as much as Apple tells us. Even knowing the Intel IPC (close to 1? close to 4?) gives one a start in asking what’s limiting performance.
  1. Daniel Lemire says:
    
    December 15, 2020 at 7:08 pm
    
    @Maynard
    
    For some context, I have not given this issue any time at all. It is not that I do not appreciate the question, and I will try to answer it, but these things take more than 30 seconds.
    
    but 1.8x the performance so more than 2x the IPC.
    
    I do not know this for a fact but it is how it looks. It must be wrong, however. I honestly do not know what to think at this point.
    
    Where’s that coming from?
    
    memory aliasing/forwarding. I don’t know how important that is with this type of code. Is there a lot of writing to a location then immediately reading back from that location?
    
    No. There is no (substantial) memory writes in the hot loops being benchmarked. You just read strings and compare the results with a min/max threshold.
    
    dependency chains. If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them. – (the opposite of the above; dependency chains are very unimportant) ie the code does a lot of “parallel” work (many independent operations at every stage) so that Apple’s 8-wide decode and extreme flexibility in wide issue are no match for Intel’s 4 (or 5 or whatever depending on the precise details) decode width and less flexible issue.
    
    The M1 could retire more instructions per cycle but could it retire 2x the number of instructions?
    
    It would need to retire something like 8 instructions per cycle. I am not kidding.
    
    Basically where I’m coming from is that this stuff isn’t magic; there are reasons Apple achieve their 2+x IPC. But we won’t discover them if (as so much of the internet insists) every time any particular aspect of the M1 is suggested as being better than x86 (better branch prediction, better memory aliasing support, …) the immediate assumption is that either Apple is not better along that dimension or, “so what if they are, it doesn’t matter”.
    
    I did not imply that your question did not matter. In fact, I raised the question in my blog post because I think it is interesting.
    
    At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). They then both crack these in different ways, then fuse the pieces in different ways.
    
    I have benchmarked this code on ARM processors before… just not on the A1. I am not new to ARM… I had an AMD ARM server…
    
    I have strong reasons to expect that the numbers of instructions retired on different ARM processors are going to be the same because (1) I expect the compiled binaries to be similar (2) I expect that there are few mispredicted branches.
    
    Then, of course, the M1 could do all sorts of fusion and stuff…
    
    My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) I’m not sure quite how one could test that claim, given that I don’t even know what performance counters Apple provides to us. But certainly on the Intel side we could learn (?) – instruction count – micro-ops counts – fused ops count? Which gives us info on that side, which we can then compare with as much as Apple tells us. Even knowing the Intel IPC (close to 1? close to 4?) gives one a start in asking what’s limiting performance.
    
    The AMD Zen 2 IPC is 4 or even slightly better than 4.
    
    I have all the numbers for these… Just run my benchmark under Linux, it is instrumented and will give you straight back (without calling perf) the counter values.
    
    It is all there… 🙂
    
    It is not that I don’t care about the questions you are asking. I do care. But like all of us, I have only 26 hours per day. 🙂

Maynard Handley says:

December 15, 2020 at 8:59 pm

M1 probably CAN retire 8 instructions per cycle… It can certainly decode 8 per cycle so if anything retire will be 8 or higher. Issue is of course way higher, but the important number is 6 wide fixed point issue. Throw in some load/stores and branches and you’re easily also at 8wide issue.
A7 started at 6 wide, and around A11 bumped that to 8.

Maybe it is as simple as — this is VERY ILP friendly code, and Apple can execute it at IPC of 8.