, 12 min read
Hasty comparison: Skylark (ARM) versus Skylake (Intel)
18 thoughts on “Hasty comparison: Skylark (ARM) versus Skylake (Intel)”
, 12 min read
18 thoughts on “Hasty comparison: Skylark (ARM) versus Skylake (Intel)”
My own results are similar in the sense that jemalloc seems to issue many more system calls (which I find surprising):
https://gist.github.com/lemire/7ca46ac9a28acce3f2654b9ce7a2350e
As I noted on Twitter, I think one way of getting more reproducible results is to use a container setup to make your dependencies more specific.
Yes, you are probably right.
Seems like you may have removed #pragma omp parallel for
from the mandelbrot program?
Some people are cross-checking your code against the benchmarks game website and becoming a little confused by that difference, so performance it may help to say in the blog post whatever you did or did not change.
Isaac: my code is available. You are correct that I did not go into the details, but I encourage you to review my code. It is implicit that the benchmarks are single-threaded. We are interested in the performance of each core, not of the whole system.
Yes, it’s unfortunate that distros don’t install an up to date GCC/GLIBC. Worse, both have many useless security features enabled which can severely impact performance. However it’s relatively easy to build your own GCC and GLIBC, so that’s what I strongly recommend for benchmarking. Use the newly built GCC for building GLIBC. You can statically link any application with GLIBC – this works without needing schroot/docker and avoids dynamic linking overheads.
GLIBC malloc has been improved significantly in the last few years: a fast-path was added for small block handling, and single-threaded paths avoid all atomic operations. I’ve seen the latter speed up malloc intensive code like Binarytree by 3-5 times on some systems. Note GLIBC has a low level hack for x86 which literally jumps over the lock prefix byte of atomic instructions. So the gain is less on x86, but it avoids nasty predecode conflicts which appear expensive.
Btw Just to add, binarytree shows sub 20 second results on an Arm server with a recent GLIBC. GLIBC is faster than Jemalloc on this benchmark.
GLIBC is faster than Jemalloc on this benchmark.
That’s good to know. My intuition is that, at least on Linux, the GCC stack has great memory allocation.
On my Haswell Macbook the results are closer to your results form Skylark.
create(): 16.143000 ms
bitset_count(b1): 1.414000 ms
iterate(b1): 5.797000 ms
iterate2(b1): 1.704000 ms
iterate3(b1): 3.577000 ms
iterateb(b1): 4.935000 ms
iterate2b(b1): 1.632000 ms
iterate3b(b1): 4.668000 ms
And profiler shows, that most of the time is spent in bzero (which is part of realloc I suppose)
564.40 ms 100.0% 0 s lemirebenchmark (8498)
559.40 ms 99.1% 0 s start
559.20 ms 99.0% 276.80 ms main
207.50 ms 36.7% 60.50 ms create
138.90 ms 24.6% 138.90 ms _platform_bzero$VARIANT$Haswell
Yes… memory allocations are slow and expensive under macOS compared to Linux. That’s a software issue (evidently).
That’s why I am not convinced that the relative weakness of the Skylark processor that I find is related to the processor. It might how memory allocations are implemented under Linux for ARM.
Yes, this looks like speed difference is in kernel mode page faults handling. Linux test on Ivy Bridge shows performance similar to Skylake.
Further testing suggests that upgrading glibc might improve performance drastically.
My test platform on CortexA53 have glibc 2.28 and Linux alarm 5.0.4-1-ARCH. Results:
create(): 45.415000 ms
bitset_count(b1): 8.408000 ms
iterate(b1): 25.324000 ms
iterate2(b1): 11.455000 ms
iterate3(b1): 30.555000 ms
iterateb(b1): 25.781000 ms
iterate2b(b1): 21.812000 ms
iterate3b(b1): 32.944000 ms
I need to find a way to upgrade my glibc somehow to run my own tests.
People writing their own malloc love to compare against glibc malloc, because it is such an easy target to beat.
You can try LD_PRELOAD with jemalloc or tcmalloc.
Probably kernel version is more important here, memset spends most time in kernel page fault function.
See my “update 2”. I was able to drastically improve speed by switching to a new memory allocation library.
Interesting. I tried to build with jmalloc on my CortexA53 and create test is slower, than with glibc:
45 ms for glibc vs 66 for jmalloc
Here is straces for both cases, syscalls used for memory allocations are different: https://gist.github.com/notorca/b8ab4ef1ef7780db8fa911b83aedac6f