Daniel Lemire's blog

, 30 min read

ARM MacBook vs Intel MacBook

36 thoughts on “ARM MacBook vs Intel MacBook”

  1. Leif says:

    I would try to use debug tools to generate flame graphs, or river diagrams, of where each algorithm is spending its time. Something like this example.

    That might provide some insight into commonalities and differences in the underlying libraries and functions.

  2. Michael says:

    You write that “[t]he Intel processor has nifty 256-bit SIMD instructions. The Apple chip has nothing of the sort as part of its main CPU.”

    The M1, like most modern ARM v8 CPUs, uses the NEON SIMD extension. The M1 has four 128-bit NEON pipelines, see the AnandTech overview.

    So the SIMD unit in the M1 is only half as wide as on current x86-64 CPUs, but “nothing of the sort” sounds a bit extreme…

    1. I am aware of NEON, but it is no match for AVX2 in general. Doubling the register width makes a big difference, at least in some cases.

      1. Royi says:

        I think in that regard they are on par.
        Per core the Intel usually have 2 ports for 256 Bit so in total it works on 512 Bit of data ( I am not talking about the CPU’s with AVX512, I’m talking about the Skylake derived CPU’s).

        The M1 has 4 units of 128 Bit each. In total it is also 512.
        Since it has much wider decoding front it won’t get hurt by not having a 256 Bit operation in a single OP.

        1. Do you have benchmark numbers of a comparison between AVX2 on a recent x64 processor (Intel/AMD) and the equivalent on ARM NEON?

          1. Royi says:

            What about the SpecFP in the Anandtech review?
            I’d guess Clang will generate in many cases vectorized code so you’ll be able to see.

            But since you have the hardware, why not give it a try?

            1. Andrei F says:

              Daniel’s background stance on this type of benchmarking surrounds software with heavy usage of intrinsics and optimised routines.

              While the compiler will spit out some SIMD here and there where it can, SPECfp is uses general use-case code without such hand-crafted vectorisation, and as such the performance uplift and impact is very minor.

          2. Andrei F says:

            How can you claim NEON is no match for AVX2 and then ask for performance numbers? That’s pretty a irresponsible stance.

            Vector size is irrelevant to the performance discussion because each µarch will be optimised around their particular setup. The total execution throughput of the M1 isn’t any less than that of your Kaby Lake chip – which is what matters.

            As other have noted, there’s plenty of NEON optimised software out there and it runs perfectly fine.

            You can even try something a simple as a portability layer to run your own benchmarks of your own AVX2 packages:

            https://simd-everywhere.github.io/blog/2020/06/22/transitioning-to-arm-with-simde.html

            For the vast majority of cases NEON should be functionally equivalent to AVX.

            1. How can you claim NEON is no match for AVX2 and then ask for performance numbers? That’s pretty a irresponsible stance.

              I don’t think it is irresponsible to ask for performance numbers. I do not like to argue in the abstract.

                1. BTW I was wrong. Not wrong to ask for benchmarks, but wrong in the belief that the M1 would not match AVX2.

        2. -.- says:

          Intel CPUs have 3x 256-bit ports, not 2x. Take note that wider SIMD doesn’t only affect the EUs, it’ll help with increasing effective PRF size, load/store etc.

          Of course, not all EUs support all operations, but I have no clue what the distribution is like on M1.

          1. Royi says:

            Intel Skylake, as far I can see and tell by WikiChip Page for Skylake has port for Floating Point operations with 256 Bit Width.

            The server variation of Skylake has 2 x 512 Bit.
            Later architectures have some other configurations.

            1. Royi says:

              A typo, I meant has 2 ports for Floating Point operations. Each port is capable of 256 Bit operations (AVX2).

              1. -.- says:

                There are 3x 256-bit ports (0, 1, 5) on Skylake. For example, Skylake can perform 3x 256b VPADDB per clock.
                If you silo yourself to FP operations only, then only ports 0 and 1 can execute them (though stuff like bitwise logic, e.g. VXORPS, can run on port 5).

                Note that 256b FP operations were added in AVX. AVX2 adds 256b integer operations.

                1. Royi says:

                  Have you looked at the WikiChip architecture page?

                  For Floating Point operations there are only 2 ports.

                  1. -.- says:

                    Yes, I’ve read that page, several times in fact.

                    Have you read and understood my previous comment? I’m guessing no, as you seem to be completely ignoring it.

                    1. You guys are saying the same thing.

  • Maynard Handley says:

    You (and other commenters) are aware of NEON, but apparently not of AMX.
    AMX may not work for the sorts of JSON parsing weirdness for which you use AVX256 (that’ll have to wait for SVE/2, probably next year) but it does solve the problem of “I want to execute dense linear algebra fast”.

    You might want to run some comparisons of that for your M1 vs Intel MacBooks… The API’s to look at are in Accelerate()
    https://developer.apple.com/documentation/accelerate

    1. I am aware of the Neural Engine but I considered it to be outside of the scope of this blog post.

      1. Maynard Handley says:

        Apple AMX (not Intel AMX) is not neural engine, it is on-CPU, no different conceptually from from NEON.

        1. I stand corrected but it would still be outside the scope of the blog post. No matrix multiplication in sight.

  • In my basic tests, I general random

    “generate”

    1. Thank you.

  • me says:

    Can you do a IO bound benchmark as reference?
    How long does it take to count the number of 1’s in the input files?

    Don’t you have concerns about Apple taxing all software on OSX via the play store with 30%?

    1. IO benchmarks are methodologically much more difficult.

  • Cyril says:

    It would be interesting to compare SIMD performance too. M1 has 128bit NEON registers, but 4 SIMD execution units, all with mul support, comparing to 2+1 in Kaby Lake.

    1. I have added a SIMD benchmark.

      1. Cyril says:

        Cool, thanks, looks very interesting. Another curious test is Lemire random number generator. M1 has 2 mul execution units for the integer pipeline, so it it can do 2 of 3 required multiplications in parallel. Probably it’s time for me to order device with M1…

  • It would be interesting to see similar benchmarks for Risc V.

    1. -.- says:

      I don’t believe any RISC-V processor is even remotely close to the level of performance of current top-end x86/ARM cores.

  • Maynard Handley says:

    “I do not yet understand why the fast_float library is so much faster on the Apple M1. It contains no ARM-specific optimization.”

    It’s far from perfect but XCode/Instruments gives you access to performance counters on M1. You could start by looking at the usual suspects – number of instructions executed and retired and number of branches and branch mispredicts.
    (I assume both the instruction flow and data memory flow are trivial enough that they aren’t blocking. So it boils down to
    – CPU width
    – branch mispredicts
    – ability to look ahead past shallow-ish dependency chains (ie deep issue queue)
    I’m not sure how you could get at the this third one. x86 probably has a perf counter that gives the average depth of the I queue, but M1 may not make such a counter user-visible — though I expect it is there)

    1. You could start by looking at the usual suspects – number of
      instructions executed and retired and number of branches and branch
      mispredicts.

      Because I have studied this code a bit (with performance counters), I know that the fast_float code has very few branch mispredictions. So I do not think that branch predictions is important in the sense that I expect both processors to predict the branch very well. Of course, from that point forward, if both have eliminated the branch misprediction bottleneck, one might do better than the other at pipelining the code.

      Given that I expect relatively few mispredictions, I expect that the number of instructions retired is going to be roughly the same as it would be on any other ARM processor. It is possible that Apple has some neat optimizer tricks in its version of LLVM, but this code is quite generic and boring. There is only so much Apple could do.

      1. Maynard Handley says:

        Well that’s the point isn’t it? Clarify the obvious basic things
        – same number of instructions?
        – same number of mispredicts?
        – but 1.8x the performance so more than 2x the IPC. Where’s that coming from?

        IF you insist on the two points stipulated above, what’s left?
        The only three issues remaining that I can see are
        – memory aliasing/forwarding. I don’t know how important that is with this type of code. Is there a lot of writing to a location then immediately reading back from that location?
        – dependency chains. If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them.
        – (the opposite of the above; dependency chains are very unimportant) ie the code does a lot of “parallel” work (many independent operations at every stage) so that Apple’s 8-wide decode and extreme flexibility in wide issue are no match for Intel’s 4 (or 5 or whatever depending on the precise details) decode width and less flexible issue.

        Basically where I’m coming from is that this stuff isn’t magic; there are reasons Apple achieve their 2+x IPC. But we won’t discover them if (as so much of the internet insists) every time any particular aspect of the M1 is suggested as being better than x86 (better branch prediction, better memory aliasing support, …) the immediate assumption is that either Apple is not better along that dimension or, “so what if they are, it doesn’t matter”.

        At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). They then both crack these in different ways, then fuse the pieces in different ways.
        My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) I’m not sure quite how one could test that claim, given that I don’t even know what performance counters Apple provides to us. But certainly on the Intel side we could learn (?)
        – instruction count
        – micro-ops counts
        – fused ops count?
        Which gives us info on that side, which we can then compare with as much as Apple tells us. Even knowing the Intel IPC (close to 1? close to 4?) gives one a start in asking what’s limiting performance.

        1. @Maynard

          For some context, I have not given this issue any time at all. It is not that I do not appreciate the question, and I will try to answer it, but these things take more than 30 seconds.

          but 1.8x the performance so more than 2x the IPC.

          I do not know this for a fact but it is how it looks. It must be wrong, however. I honestly do not know what to think at this point.

          Where’s that coming from?

          memory aliasing/forwarding. I don’t know how important that is with this type of code. Is there a lot of writing to a location then immediately reading back from that location?

          No. There is no (substantial) memory writes in the hot loops being benchmarked. You just read strings and compare the results with a min/max threshold.

          dependency chains. If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them. – (the opposite of the above; dependency chains are very unimportant) ie the code does a lot of “parallel” work (many independent operations at every stage) so that Apple’s 8-wide decode and extreme flexibility in wide issue are no match for Intel’s 4 (or 5 or whatever depending on the precise details) decode width and less flexible issue.

          The M1 could retire more instructions per cycle but could it retire 2x the number of instructions?

          It would need to retire something like 8 instructions per cycle. I am not kidding.

          Basically where I’m coming from is that this stuff isn’t magic; there are reasons Apple achieve their 2+x IPC. But we won’t discover them if (as so much of the internet insists) every time any particular aspect of the M1 is suggested as being better than x86 (better branch prediction, better memory aliasing support, …) the immediate assumption is that either Apple is not better along that dimension or, “so what if they are, it doesn’t matter”.

          I did not imply that your question did not matter. In fact, I raised the question in my blog post because I think it is interesting.

          At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). They then both crack these in different ways, then fuse the pieces in different ways.

          I have benchmarked this code on ARM processors before… just not on the A1. I am not new to ARM… I had an AMD ARM server…

          I have strong reasons to expect that the numbers of instructions retired on different ARM processors are going to be the same because (1) I expect the compiled binaries to be similar (2) I expect that there are few mispredicted branches.

          Then, of course, the M1 could do all sorts of fusion and stuff…

          My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) I’m not sure quite how one could test that claim, given that I don’t even know what performance counters Apple provides to us. But certainly on the Intel side we could learn (?) – instruction count – micro-ops counts – fused ops count? Which gives us info on that side, which we can then compare with as much as Apple tells us. Even knowing the Intel IPC (close to 1? close to 4?) gives one a start in asking what’s limiting performance.

          The AMD Zen 2 IPC is 4 or even slightly better than 4.

          I have all the numbers for these… Just run my benchmark under Linux, it is instrumented and will give you straight back (without calling perf) the counter values.

          It is all there… 🙂

          It is not that I don’t care about the questions you are asking. I do care. But like all of us, I have only 26 hours per day. 🙂

  • Maynard Handley says:

    M1 probably CAN retire 8 instructions per cycle… It can certainly decode 8 per cycle so if anything retire will be 8 or higher. Issue is of course way higher, but the important number is 6 wide fixed point issue. Throw in some load/stores and branches and you’re easily also at 8wide issue.
    A7 started at 6 wide, and around A11 bumped that to 8.

    Maybe it is as simple as — this is VERY ILP friendly code, and Apple can execute it at IPC of 8.