26th March 2019, 12 min read

Hasty comparison: Skylark (ARM) versus Skylake (Intel)

18 thoughts on “Hasty comparison: Skylark (ARM) versus Skylake (Intel)”

Cyril says:

March 27, 2019 at 8:19 am

On my Haswell Macbook the results are closer to your results form Skylark.

create(): 16.143000 ms bitset_count(b1): 1.414000 ms iterate(b1): 5.797000 ms iterate2(b1): 1.704000 ms iterate3(b1): 3.577000 ms iterateb(b1): 4.935000 ms iterate2b(b1): 1.632000 ms iterate3b(b1): 4.668000 ms

And profiler shows, that most of the time is spent in bzero (which is part of realloc I suppose)

564.40 ms 100.0% 0 s lemirebenchmark (8498) 559.40 ms 99.1% 0 s start 559.20 ms 99.0% 276.80 ms main 207.50 ms 36.7% 60.50 ms create 138.90 ms 24.6% 138.90 ms _platform_bzero$VARIANT$Haswell
1. Daniel Lemire says:
  
  March 27, 2019 at 12:09 pm
  
  Yes… memory allocations are slow and expensive under macOS compared to Linux. That’s a software issue (evidently).
  
  That’s why I am not convinced that the relative weakness of the Skylark processor that I find is related to the processor. It might how memory allocations are implemented under Linux for ARM.
  1. Cyril says:
    
    March 27, 2019 at 10:18 pm
    
    Yes, this looks like speed difference is in kernel mode page faults handling. Linux test on Ivy Bridge shows performance similar to Skylake.
    1. Daniel Lemire says:
      
      March 27, 2019 at 10:54 pm
      
      Further testing suggests that upgrading glibc might improve performance drastically.
      1. Cyril says:
        
        March 28, 2019 at 10:03 am
        
        My test platform on CortexA53 have glibc 2.28 and Linux alarm 5.0.4-1-ARCH. Results:
        
        create(): 45.415000 ms bitset_count(b1): 8.408000 ms iterate(b1): 25.324000 ms iterate2(b1): 11.455000 ms iterate3(b1): 30.555000 ms iterateb(b1): 25.781000 ms iterate2b(b1): 21.812000 ms iterate3b(b1): 32.944000 ms
        
        Daniel Lemire says:
        
        March 28, 2019 at 1:11 pm
        
        I need to find a way to upgrade my glibc somehow to run my own tests.
        
        Jörn Engel says:
        
        March 28, 2019 at 5:11 pm
        
        People writing their own malloc love to compare against glibc malloc, because it is such an easy target to beat.
        
        You can try LD_PRELOAD with jemalloc or tcmalloc.
        
        Cyril says:
        
        March 28, 2019 at 6:41 pm
        
        Probably kernel version is more important here, memset spends most time in kernel page fault function.
        
        Daniel Lemire says:
        
        March 28, 2019 at 7:16 pm
        
        See my “update 2”. I was able to drastically improve speed by switching to a new memory allocation library.
        
        Cyril says:
        
        March 29, 2019 at 10:55 am
        
        Interesting. I tried to build with jmalloc on my CortexA53 and create test is slower, than with glibc:
        
        45 ms for glibc vs 66 for jmalloc
        Here is straces for both cases, syscalls used for memory allocations are different: https://gist.github.com/notorca/b8ab4ef1ef7780db8fa911b83aedac6f