Daniel Lemire's blog

, 17 min read

Apple’s M1 processor and the full 128-bit integer product

17 thoughts on “Apple’s M1 processor and the full 128-bit integer product”

  1. Apple M1 chip is a warehouse/workshop model

    Copyright © 2018 Lin Pengcheng. All rights reserved.

    Introduction

    Computer hardware is also a factory that produces data, so it can also apply the “warehouse/workshop model”, The model uses memory as the core, not the CPU. Finally, we can achieve the grand unification of all IT fields such as hardware, software, Internet, and Internet of Things.

    Warehouse: Memory
    Workshop: CPU, graphics card, sound card, etc.
    Standardized data: data transmitted between hardware that conforms to industry standard interfaces
    Acceptance: Motherboard with standardized interfaces such as PCI, SATA, USB, etc.
    External standardized data: hard disk, flash drive, etc.

    The out-of-order execution technology of modern CPUs is a mistake (February 16, 2021)

    Out-of-order execution is a product of wrong programming methodology, wrong computer architecture and weak compiler conditions.

    In the “warehouse/workshop model”, the workshop is an orderly and high-speed ray (pipeline). The warehouse scheduling function performs dynamic planning and unified scheduling for all workshops and resources, without conflict and competition, and runs in the optimal order and efficiency.

    Follower Case

    My computer hardware architecture design was published on February 06, 2019. One or two years later, the Apple M1 chip adopted the “warehouse/workshop model” design and was released on November 11, 2020.

    Warehouse: unified memory
    Workshop: CPU, GPU and other cores
    Products (raw materials): information, data

    there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance. reference: Developer Delves Into Reasons Why Apple’s M1 Chip is So Fast

    From the introduction

    Apple M1 has not done global optimization of various core (workshop) scheduling.
    Apple M1 only optimizes the access to memory data (materials and products in the warehouse).
    Apple needs to further improve the programming language and compiler to support and promote my programming methodology.
    My architecture supports a wider range of workshop types than Apple M1, with greater efficiency, scalability and flexibility.

    Conclusion

    Apple M1 chip still needs a lot of optimization work, now its optimization level is still very simple, after all, it is only the first generation of works, released in stages.

    Forecast(2021-01-19): I think Intel, AMD, ARM, supercomputer, etc. will adopt the “warehouse/workshop model”

    In the past, the performance of the CPU played a decisive role in the performance of the computer. There were few CPU cores and the number and types of peripherals. Therefore, the CPU became the center of the computer hardware architecture.

    Now, with more and more CPU and GPU cores, and the number and types of peripherals, the communication, coordination, and management of cores (or components, peripherals) have become more and more important, They become a key factor in computer performance.

    The core views of management science and computer science are the same: Use all available resources to complete the goal with the highest efficiency. It is the best field of management science to accomplish production goals through communication, coordination, and management of various available resources. The most effective, reliable, and absolutely mainstream way is the “warehouse/workshop model”.

    Only changing the architecture, not changing or only expanding the CPU instruction set, not only will not affect the CPU compatibility, but also bring huge optimization space.

    So I think Intel, AMD, ARM, supercomputing, etc. will adopt the “warehouse/workshop model”, which is an inevitable trend in the development of computer hardware. My unified architecture and programming methodology will be vigorously promoted by these CPU companies, sweeping the world from the bottom up.

    Finally, the “warehouse/workshop model” will surely replace the “von Neumann architecture” and become the first architecture in the computer field, and it is the first architecture to achieve a unified software and hardware.

    link: The Grand Unified Programming Theory: The Pure Function Pipeline Data Flow with principle-based Warehouse/Workshop Model

    1. Andrew says:

      Did you just post an essay as a comment on someone else’s blog?

  2. Maynard Handley says:

    For people who are interested in the details: what suggested to me that the pair UMULH+MUL might be fused is the following patent:

    https://patents.google.com/patent/US9223577B2/

    This patent is interesting because traditional fused instructions were restricted to zero or no output register (think eg the traditional cmp+branch pair, or the ARM crypto fused pairs). The reason for this is that the traditional pipeline has a destination register allocation stage (usually called Rename) which (for various good reasons) is set up to allocate one destination register per instruction.

    What the patent describes is a very neat scheme whereby an instruction can be split at decode into multiple sub-instructions that pass through rename separately but then (and this is the novel part) they are recombined for the purposes of scheduling and execution.

    This is actually a remarkably nice idea. It allows Apple to treat the common ARM instruction Load Pair as a single unit for most purposes, even though that overwrites two instructions. And it allows various interesting instruction pairs (like UMULH and MUL) to be fused into one operation where that makes sense, even if that combined operation generates two destination registers.

    It would be interesting if anyone reading this can think of further instruction pairs that are likely fused.
    Dougall Johnson’s initial M1 explorations at

    https://dougallj.github.io/applecpu/firestorm.html

    lists the known fused instructions (some crypto, and what are essentially compare+branch).
    His list omits

    the obvious case of ADR+ADRP (done since the first ARMv8 cores)
    arithmetic followed by a branch (but without setting a flag, eg ADD+CBZ)
    the probable case (but not yet tested) of constructing large immediates via MOV+MOVK

    All those are obvious. Not obvious (but the obvious next interesting case to test for the purposes of this blog!) is the reverse of multiplication, namely division.
    The problem is suppose you want to perform both a divide and a remainder, another situation that’s generally tricky to handle optimally because once again there are two result registers.
    The ARMv8 solution is a UDIV (to generate the divid result) followed by MSUB (to generate the remainder). This is another obvious fusion candidate if you have the ability for one fused instruction to overwrite two registers.

  3. Laurent says:

    Apple M1 has two fully pipelined integer pipes that can do MUL or MULH. This means it can produce one full 64-bit x 64-bit -> 128-bit per cycle. (More information is available on Dougall Johnson site.)

    GMP lib assembly loop results prove how well it does on such code. For instance the multi-precision multiplication loop mul_1 produces 64-bit each cycle, while the fastest x86 chip needs 1.5 cycle.

    Of course the number of cycles is not the full story, but that M1 is quite good at crunching integer numbers 🙂

  4. rando says:

    Why dont you ever post/look at the generated assembly in such small
    benchmarks? Especially in profiler tools like Intel Vtune and Amd uProf, which would provide instruction level insight on modern CPUs and give potentially true answers to architecture questions. Eg if a bottleneck is the number of ALUs or pipeline width or dependencies , etc.

    For example I looked at Lehmers rng in godbold Clang/Gcc and they generate slightly different order of imul/mul/add.

    [imul mul add] OR [mul imul add] – where imul and add have the dependency and are in direct sequence, while imul mul add have 1 instruction between them. Still one cant really tell if that would be helpful on modern out-of-order machines. A profiler would tell.

    Are there profiler tools like that for the Apple M1?

    I think posting asm/profiles would be more enlightening (and entertaining…) than just counting the number of potentially expensive instructions. Though more work involved…

    1. Why dont you ever post/look at the generated assembly in such small
      benchmarks?

      I recommend using godbolt.org.

      Especially in profiler tools like Intel Vtune and Amd uProf, which
      would provide instruction level insight on modern CPUs and give
      potentially true answers to architecture questions. Eg if a bottleneck
      is the number of ALUs or pipeline width or dependencies , etc.

      I do not expect Intel Vtune and Amd uProf to work on Apple M1 macbooks. One can use Apple’s Instruments, however.

      1. I posted assembly code and I have updated the benchmark with instrumented code which records the cycles and instructions retired.

    2. Eddy Current says:

      Why don’t you post your source code plus the URLs you got from compiler explorer?

  5. Antoine says:

    Or simply, the M1 has enough execution units and reordering capacity to compute two or three 64-bit multiplications at once?

    Note that the way wyhash() is implemented, computing the next state is just a trivial addition from the current state, so a modern CPU would be able to overlap several consecutive calls to wyhash(). You don’t need a 128-bit multiplier to explain the timings, IMHO (which doesn’t mean the 128-bit multiplier doesn’t exist, of course :-)).

    1. Antoine says:

      Uh, by the way, this page (at the time I’m writing this) and the RSS feed show different numbers. The RSS feed says 0.30ns vs 0.45ns for wyhash vs splitmix, this page says 0.60ns vs 0.85ns. What are the right values?

      1. The blog post was updated because I was dividing by half the number of integers. This is explained in the blog post (see at the bottom).

        Both the page and the RSS feed are updated in sync.

    2. Or simply, the M1 has enough execution units and reordering capacity
      to compute two or three 64-bit multiplications at once?

      My tests are not thorough enough to enable me to conclude one way or the other about the underlying technology, so my conclusion is merely stated as “Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers”.

    3. George Spelvin says:

      Note that the way wyhash() is implemented, computing the next state is just a trivial addition from the current state, so a modern CPU would be able to overlap several consecutive calls to wyhash().

      Er, both functions are of this type. splitmix64 uses an additive constant of 0x9E3779B97F4A7C15, while wyhash uses an additive constant of 0x60bee2bee120fc15. Thus, there should be enough ILP in either function to saturate the processor’s functional units.

      64×64 multipliers are large and expensive, and integer multiply isn’t that common an operation. It’s hard to imagine equipping a non-DSP processor with more than one.

      1. It is correct, both functions can interleave their operations so that the throughput differs from the reciprocal of the latency.

        1. But it is absolutely true that there is no need for the mul/mulh to be fused, and my blog post does not claim that they are.

  6. Cyril says:

    This is not M1 specific and happened somewhere around Apple A10 SOC. Unfortunately I do not have device with A10 to test but here are results for devices form A9 to A14. And referring to our discussion it would expect the similar performance results from Cortex A75.

    A14
    splitmix 0.45 ns/value (5.65 %)
    wyrng 0.3 ns/value (13.9 %)

    A13
    splitmix 0.5 ns/value (3.76 %)
    wyrng 0.35 ns/value (11.6 %)

    A12X
    splitmix 0.55 ns/value (8.7 %)
    wyrng 0.4 ns/value (3.67 %)

    A11
    splitmix 0.6 ns/value (4.24 %)
    wyrng 0.4 ns/value (7.36 %)

    A9
    splitmix 1.2 ns/value (29.7 %)
    wyrng 1.5 ns/value (31.1 %)

    1. Yes Cyril, thank you.