Daniel Lemire's blog

, 12 min read

Hasty comparison: Skylark (ARM) versus Skylake (Intel)

18 thoughts on “Hasty comparison: Skylark (ARM) versus Skylake (Intel)”

  1. Cyril says:

    On my Haswell Macbook the results are closer to your results form Skylark.

    create(): 16.143000 ms
    bitset_count(b1): 1.414000 ms
    iterate(b1): 5.797000 ms
    iterate2(b1): 1.704000 ms
    iterate3(b1): 3.577000 ms
    iterateb(b1): 4.935000 ms
    iterate2b(b1): 1.632000 ms
    iterate3b(b1): 4.668000 ms

    And profiler shows, that most of the time is spent in bzero (which is part of realloc I suppose)

    564.40 ms 100.0% 0 s lemirebenchmark (8498)
    559.40 ms 99.1% 0 s start
    559.20 ms 99.0% 276.80 ms main
    207.50 ms 36.7% 60.50 ms create
    138.90 ms 24.6% 138.90 ms _platform_bzero$VARIANT$Haswell

    1. Yes… memory allocations are slow and expensive under macOS compared to Linux. That’s a software issue (evidently).

      That’s why I am not convinced that the relative weakness of the Skylark processor that I find is related to the processor. It might how memory allocations are implemented under Linux for ARM.

      1. Cyril says:

        Yes, this looks like speed difference is in kernel mode page faults handling. Linux test on Ivy Bridge shows performance similar to Skylake.

        1. Further testing suggests that upgrading glibc might improve performance drastically.

          1. Cyril says:

            My test platform on CortexA53 have glibc 2.28 and Linux alarm 5.0.4-1-ARCH. Results:

            create(): 45.415000 ms
            bitset_count(b1): 8.408000 ms
            iterate(b1): 25.324000 ms
            iterate2(b1): 11.455000 ms
            iterate3(b1): 30.555000 ms
            iterateb(b1): 25.781000 ms
            iterate2b(b1): 21.812000 ms
            iterate3b(b1): 32.944000 ms

            1. I need to find a way to upgrade my glibc somehow to run my own tests.

              1. Jörn Engel says:

                People writing their own malloc love to compare against glibc malloc, because it is such an easy target to beat.

                You can try LD_PRELOAD with jemalloc or tcmalloc.

                1. Cyril says:

                  Probably kernel version is more important here, memset spends most time in kernel page fault function.

                  1. See my “update 2”. I was able to drastically improve speed by switching to a new memory allocation library.

                    1. Cyril says:

                      Interesting. I tried to build with jmalloc on my CortexA53 and create test is slower, than with glibc:

                      45 ms for glibc vs 66 for jmalloc
                      Here is straces for both cases, syscalls used for memory allocations are different: https://gist.github.com/notorca/b8ab4ef1ef7780db8fa911b83aedac6f

  • My own results are similar in the sense that jemalloc seems to issue many more system calls (which I find surprising):

    https://gist.github.com/lemire/7ca46ac9a28acce3f2654b9ce7a2350e

  • Ed Vielmetti says:

    As I noted on Twitter, I think one way of getting more reproducible results is to use a container setup to make your dependencies more specific.

    1. Yes, you are probably right.

  • Isaac Gouy says:

    Seems like you may have removed #pragma omp parallel for from the mandelbrot program?

    Some people are cross-checking your code against the benchmarks game website and becoming a little confused by that difference, so performance it may help to say in the blog post whatever you did or did not change.

    1. Isaac: my code is available. You are correct that I did not go into the details, but I encourage you to review my code. It is implicit that the benchmarks are single-threaded. We are interested in the performance of each core, not of the whole system.

  • Wilco says:

    Yes, it’s unfortunate that distros don’t install an up to date GCC/GLIBC. Worse, both have many useless security features enabled which can severely impact performance. However it’s relatively easy to build your own GCC and GLIBC, so that’s what I strongly recommend for benchmarking. Use the newly built GCC for building GLIBC. You can statically link any application with GLIBC – this works without needing schroot/docker and avoids dynamic linking overheads.

    GLIBC malloc has been improved significantly in the last few years: a fast-path was added for small block handling, and single-threaded paths avoid all atomic operations. I’ve seen the latter speed up malloc intensive code like Binarytree by 3-5 times on some systems. Note GLIBC has a low level hack for x86 which literally jumps over the lock prefix byte of atomic instructions. So the gain is less on x86, but it avoids nasty predecode conflicts which appear expensive.

    1. Wilco says:

      Btw Just to add, binarytree shows sub 20 second results on an Arm server with a recent GLIBC. GLIBC is faster than Jemalloc on this benchmark.

      1. GLIBC is faster than Jemalloc on this benchmark.

        That’s good to know. My intuition is that, at least on Linux, the GCC stack has great memory allocation.