Daniel Lemire's blog

, 3 min read

Apple´s M1 processor and the full 128-bit integer product

If I multiply two 64-bit integers (having values in [0, 264)), the product requires 128 bits. Intel and AMD processors (x64) can compute the full (128-bit) product of two 64-bit integers using a single instruction (mul). ARM processors, such as those found in your mobile phone, require two instructions to achieve the same result: mul computes the least significant 64 bits while mulh computes the most significant 64 bits.

I believe that it has often meant that computing the full 128-bit product was more expensive, everything else being equal, on ARM processors than on x64 (Intel) processors. However, the instruction set does not have to determine the performance. For example, ARM processors can recognize that I am calling both instructions (mul and mulh) and process them more efficiently. Or they may simply have very inexpensive multipliers.

To explore the problem, let us pick two pseudo-random number generators, splitmix and wyhash:

uint64_t <b>splitmix</b>() {
    uint64_t z = (state += UINT64_C(0x9E3779B97F4A7C15));
    z = (z ^ (z >> 30)) * UINT64_C(0xBF58476D1CE4E5B9);
    z = (z ^ (z >> 27)) * UINT64_C(0x94D049BB133111EB);
    return z ^ (z >> 31);
}
uint64_t <b>wyhash</b>() {
    state += 0x60bee2bee120fc15ull;
    __uint128_t tmp = (__uint128_t)(state)*0xa3b195354a39b70dull;
    uint64_t m1 = (tmp >> 64) ^ tmp;
    tmp = (__uint128_t)m1 * 0x1b03738712fad5c9ull;
    return (tmp >> 64) ^ tmp;
 }

As I reported earlier, wyhash should almost always be faster on an Intel or AMD processor as it is only two multiplications with an addition whereas the splitmix function is made of two multiplications with several other operations. However, wyhash requires two full multiplications whereas splitmix requires only two 64-bit products. If the two full multiplications in wyhash are equivalent two four multiplications, then wyhash becomes more expensive.

I wrote a small C++ benchmark to measure the time (in nanoseconds) that it takes to compute a random value using Apple’s new M1 processor (ARM, 3.2 GHz). The compiler is Apple clang version 12 which comes by default on the new Apple Silicon laptops.

  Apple M1
wyhash 0.60 ns/value
splitmix 0.85 ns/value

My test suggests that it takes a bit under 3 cycles to generate a number with splitmix and a bit under 2 cycles to generate a number with wyhash. The wyhash generator is much faster than splitmix on the Apple M1 processor (by about 40% to 50%) which suggests that Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers. It would be nicer to be able to report the number of instructions and cycles, but I do not know how to instrument code under macOS for such measures.

Further reading: Apple M1 Microarchitecture Research by Dougall Johnson

Credit: Maynard Handley suggested this blog post.

Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

Update 2: An interested reader provided me with the means to instrument the code. The precise number of cycles per value is a bit over 2.8 for splitmix and exactly 2 for wyhash. Please see my repository for the corresponding code.

Appendix. Some readers are demanding assembly outputs. The splitmix function compiles to 9 assembly instructions

LBB7_2:                                 ; =>This Inner Loop Header: Depth=1
    eor	x12, x9, x9, lsr #30
    mul	x12, x12, x10
    eor	x12, x12, x12, lsr #27
    mul	x12, x12, x11
    eor	x12, x12, x12, lsr #31
    str	x12, [x0], #8
    add	x9, x9, x8
    subs	x1, x1, #1              ; =1
    b.ne	LBB7_2

while the wyhash function compiles to 10 assembly instructions

LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
    mul	x12, x9, x10
    umulh	x13, x9, x10
    eor	x12, x13, x12
    mul	x13, x12, x11
    umulh	x12, x12, x11
    eor	x12, x12, x13
    str	x12, [x0], #8
    add	x9, x9, x8
    subs	x1, x1, #1              ; =1
    b.ne	LBB8_2