20th March 2019, 9 min read

ARM and Intel have different performance characteristics: a case study in random number generation

13 thoughts on “ARM and Intel have different performance characteristics: a case study in random number generation”

Travis Downs says:

March 20, 2019 at 5:35 pm

Computing the high bits of 64×64 is how expensive on this ARM server? I mean there’s a 20x relative difference in performance…

What type of ARM? “Skylarke ARM” doesn’t turn up many hits – mostly stuff about a nice farm that does weddings.
1. Ricardo Bánffy says:
  
  March 20, 2019 at 7:44 pm
  
  You can find more info on that specific ARM implementation here: https://en.wikichip.org/wiki/apm/microarchitectures/skylark
2. Daniel Lemire says:
  
  March 20, 2019 at 7:45 pm
  
  There was a typo in my post. It is Skylark… https://en.wikichip.org/wiki/apm/microarchitectures/skylark
  
  I can give you access to the box.
3. Daniel Lemire says:
  
  March 20, 2019 at 7:51 pm
  
  I don’t have exact numbers for Skylark. On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles.
  
  http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
  1. Maynard Handley says:
    
    March 17, 2021 at 1:43 am
    
    Daniel, if you have access to an M1, try the performance there, along with looking at the assembly.
    
    Of course there is the basic “M1 is fast” stuff, that’s not interesting.
    
    What’s interesting is that the 128b multiply should be coded as a UMULH and a MUL instruction pair. Apple has a more or less generic facility to support instructions with multiple destination registers, which means that, in principle, these two multiplies could be fused, and thus executed faster than two successive independent multiply-type operations.
    
    Does Apple in fact do this? Is 128b multiplication considered a common enough operation to special-case? Who knows? But they do, of course, special case and fuse various of the other obvious specialized crypto instruction pairs.
    1. Daniel Lemire says:
      
      March 17, 2021 at 11:12 pm
      
      See https://lemire.me/blog/2021/03/17/apples-m1-processor-and-the-full-128-bit-integer-product/
Cyril says:

March 21, 2019 at 3:15 pm

Results form Pine64 on CortexA53:

wyrng 0.013576 s
bogus:14643616649108139168
splitmix64 0.010964 s
bogus:18305447471597396837
1. Cyril says:
  
  March 21, 2019 at 3:24 pm
  
  And here is the numbers form my laptop, Intel i5-4250U.
  wyrng 0.000929 s
  bogus:15649925860098344998
  splitmix64 0.000842 s
  bogus:15901732380406292985
  1. Daniel Lemire says:
    
    March 21, 2019 at 3:48 pm
    I don’t benchmark on laptops, but here is what I get on my haswell server (i7-4770):
```
$ g++ --version
g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
$ g++  -std=c++11 -O2 -fno-tree-vectorize -o fastestrng fastestrng.cpp && ./fastestrng
wyrng       0.000431 s
bogus:14643616649108139168
splitmix64      0.000587 s
bogus:18305447471597396837
lehmer64    0.000569 s
bogus:16285628012437095220
lehmer64 (3)    0.000392 s
bogus:15342908890590157271
lehmer64 (3)    0.000379 s
bogus:18372309517275774290
Next we do random number computations only, doing no work.
wyrng       0.000442 s
bogus:15649925860098344998
splitmix64      0.000567 s
bogus:15901732380406292985
lehmer64    0.000568 s
bogus:6253507633689833227
lehmer64 (2)    0.000459 s
bogus:17457190375316347997
lehmer64 (3)    0.000361 s
bogus:4305661330232405915
```
    Email me if you want access to it.
    1. Cyril says:
      
      March 21, 2019 at 4:13 pm
      
      I have enough hardware to test, next I want to try on 64bit Atom. But my point here, performance of such things does not really depends on the instructions set (ARMv8 vs amd64). It depends on internal CPU architecture. Cortex A53 and Apple A11 are both armv8 cpus, but on A11 wyrng is faster and on A53 splitmix64 is faster.
      1. Daniel Lemire says:
        
        March 21, 2019 at 4:22 pm
        
        I agree.
        
        Cyril says:
        
        March 21, 2019 at 9:52 pm
        
        Another important point in such comparisons is compiler. On my CortexA53 lehmer64 (2) is fastest with gcc, and lehmer64 (3) is fastest with clang. Looks like gcc generates full 128×128 bit multiplication, while clang generates 128×64.
2. Cyril says:
  
  March 21, 2019 at 4:06 pm
  
  And on iPhone X with AppleA11 wyrng is faster:
  wyrng 0.000563 s
  bogus:12179112671541558566
  splitmix64 0.000728 s
  bogus:808196752756138662