Daniel Lemire's blog

, 8 min read

ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark

11 thoughts on “ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark”

  1. The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close

    Graviton3 has Arm SVE, which includes predication. (in a 2x 256b setup, so significantly less throughput though)

    1. AVX-512 is far superior to SVE in terms of how powerful the instructions are. And though the Graviton 3 has 256-bit registers, it looks like most ARM designs are going back to 128-bit registers which will leave x64 processors with significantly more powerful SIMD instructions than ARM processors.

  2. Laurent says:

    Hello,

    a point worth considering is that hyperthreading likely is enabled on the x86 instances which will have a negative impact even on single-threaded workloads if the machine is fully used.

    For a fully loaded machine, 2 vCPU x86 are likely worse than 2 CPU Arm (until memory bandwidth hits the Arm machine with all its CPUs competing to access it :-).

    I think this might explain why you don’t see the same speedup as opdroid1234 who seems to be running heavily threaded tasks (compilation).

    1. One can always object that the comparison is biased in favour of Intel but in this instance, the x64 node is more expensive than the ARM node.

      Granted, it is possible that Amazon is subsidizing its ARM hardware.

  3. N says:

    Here is a output from Oracle cloud Neoverse-N1 Arm64 “Shared CPU”

    [root@instance-20220729-0825 ada]# ./build/benchmarks/bench –benchmark_filter=Ada
    2023-03-02T14:44:17+00:00
    Running ./build/benchmarks/bench
    Run on (4 X 50 MHz CPU s)
    Load Average: 0.61, 0.27, 0.10
    ada spec: Ada follows whatwg/url
    bytes/URL: 73.454545
    curl : OMITTED
    input bytes: 808
    number of URLs: 11

    performance counters: Enabled

    Benchmark Time CPU Iterations UserCounters…

    BasicBench_AdaURL 4152 ns 4145 ns 167020 GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s

    1. N says:

      Sorry about unreadable copy/paste

      performance counters: Enabled

      Benchmark Time CPU Iterations UserCounters…

      BasicBench_AdaURL 4152 ns 4145 ns 167020

      GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s

      1. Jeffrey W. Baker says:

        That’s pretty useful, demonstrates well that Graviton 2 is nothing but a vanilla implementation of Neoverse-N1, and it can be replicated by any licensee.

        1. In this test, the Oracle system does a bit worse (376.844ns/url vs 320 ns/url) but that’s a small difference. Furthermore, I used the Graviton 3, not 2.

          No matter: your point is correct, I think, others can no doubt compete against the Amazon’s Graviton processors. It just happens that it is easy for me to have access to AWS, so that’s what I use.

          This makes Intel’s stance even more perilous, it seems to me.

  4. Alex Petrov says:

    Historical tails of x86, complex instruction set CISC, heavy logics, hard prediction, modes switches, workarounds for old bugs, extra logics/extra silicone bigger size/latency – making x86 less efficient.

    Check this article from Erik.
    https://erik-engheim.medium.com/arm-vs-risc-v-vector-extensions-992f201f402f

  5. Oliver Jones says:

    You know what metric would be super-interesting?

    watt hours / URL or maybe joules / URL.

    I understand it’s impossible to estimate VM power consumption. Nevertheless, energy cost is important.

    1. I agree.