Typo: I am excited because I think it will drive other laptop makes to rethink their designs.
Bertsays:
This is true, whether you are an x86 loyalist or indifferent, the old assumptions are all being turned on their heads. I think we will see even more progress from AMD and Intel now that Apple is here to shake up the rankings.
Bobsays:
I wonder when we will see laptops supporting ARM SVE (NEON successor)
SVE2 doesn’t explicitly show up on any of those public roadmap slides, so it’s probably a couple years out—at least in cores designed by ARM. Although, as AnandTech points out, “SVE” in the slide may actually refer to SVE2 in some cases.
ARM first disclosed SVE several years ago, but is only just now starting to make SVE-capable cores. I wouldn’t be surprised if we had to wait another few years to buy an end product that offers SVE2.
Even though the Neoverse-V1 is “available now,” that doesn’t mean I can go buy a machine sporting one. It means silicon vendors can license and start building chips around it. It’ll be some time before you see volume product.
Why such slow adoption? Wide SIMD in the CPU just wasn’t that important to cell phones. It’s too power hungry, and it was hard to keep the ARM CPUs fed. Dedicated accelerators were a better fit in that product space, particularly from an energy efficiency standpoint.
In a workstation or server, you have different set of constraints. And, now we have some decent interconnects.
Challenges remain: it’s one thing to plop down the functional units for these wide vectors. Managing power—both peak and transient—is another kettle of fish.
never_releasedsays:
Neoverse-V1 is ARMv8.4-A + 2x 256-bit SVE. (and was finished this year)
Neoverse-N2 is ARMv8.5-A + 2x 128-bit SVE2. (and will be in finished form next year)
Of course, that means finished on Arm’s side, that means that we should expect Neoverse-V1 designs in 2021 and Neoverse-N2 designs in 2022.
It looks like there is compiler and emulator support for SVE/SVE2 but the only available silicon is the Fujitsu A64FX (pdf) with SVE.
You have identified an area that Apple/Amazon Arm64 silicon is playing catchup to x64 on both desktop and server: vectorized SIMD algorithms.
Maynard Handleysays:
Calling this catchup is misleading. SVE/2 is not just wider NEON, it is a rethinking of how to design a vector ISA a for much better compiler auto-vectorization (A very rough figure of merit is 128-bit wide SVE would run a “broad suite” of autovectorized code about 1.3x faster than NEON).
If we want to use these sorts of terms, leapfrogging would be more appropriate.
My feeling is that he was basing his statement on my (erroneous) earlier results.
I think that there is wide agreement that SVE is exciting new tech.
RADsays:
SVE2 looks great but we are not going to see it in mainstream silicon until the next generation of Apple and Amazon chips at best. In every other area, the Apple M1 and Amazon Graviton 2 seem to offer the best bang-for-the-buck over x64. Until Neoverse V1/N2 silicon is available, I don’t think we will see a business case for a scale-up in-memory column store like SAP HANA moving away from Intel.
Benchmarks using Daniel’s EWAH and/or Roaring Bitmap projects should be able to approximate when Arm ports make sense. We need more real-world SIMD-centric benchmarks; maybe Lucene/ElasticSearch, Apache Arrow, DuckDB, ClickHouse?
“This article has a mistake. I actually ran the benchmark, and it doesn’t return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.”
Given the fact the NVIDIA is buying ARM there is no negligible chance
that they change licensing policies…
However may the idea of successful ARM laptops will push somebody to try the same stint with MIPS.
This could be an extremely interesting development.
Veedracsays:
You should read the HN comments to this post, which claim you made an error generating these numbers, and the correct values for M1 are 6.6 GB/s and 16.5 GB/s.
Some people over on Hacker News seem to think you ran your test with Rosetta on, the x86 emulation mode. Also the library seems to have difficulty properly detecting ARM. When these couple things were corrected, the benchmarks are pretty different.
Glenn: yes. They are correct. I made a mistake. The blog post has been updated.
Nathan Kurzsays:
I know you don’t usually read it, and I don’t know why they didn’t leave a comment here, but there are a few comments on HN that suggest you might have significant bug in your benchmark: https://news.ycombinator.com/item?id=25408853.
The summary would seem to be that ARM64 isn’t being properly detected by the macros in the simdjson code, resulting in the executable using the “generic fallback implementation”. The simple fix is to add an explicit “-DSIMDJSON_IMPLEMENTATION_ARM64=1” to the compilation. With this, one of the commenters got “minify” at 6.64796 GB/s, and “validate” at 16.4721 GB/s, concluding “That puts Intel at 1.17x and 1.15x for this specific test”.
Carlsays:
I believe that the big issue noted in the HN thread is that the Arm benchmarks appear to be using x86 code running under Rosetta. With real ARM64 code and more optimisation this gets the benchmarks to minify : 6.73381 GB/s and validate: 17.8548 GB/s so 1.16x and 1.06x.
Thanks Nate. I don’t follow hacker news, but the mistake was pointed out to me on Twitter.
I have revised the blog post. I was wrong.
I am happy to admit it.
Nathan Kurzsays:
Hi Daniel,
I see now that you got lots of notification besides me. Sorry for adding to the pile. To partially make up for it, I ran your benchmark on a MacBook Air with Ice Lake for a more direct comparison:
% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz
% ./benchmark
simdjson v0.7.0
Detected the best implementation for your machine: haswell (Intel/AMD AVX2)
loading twitter.json
minify : 7.47081 GB/s
validate: 34.4244 GB/s
I was hoping that we might be able to see the effect of AVX512, but I see now that the simdjson code doesn’t yet support it. If you happen to an unreleased version that has it, I’d be happy to test and report.
Samsays:
Comment from HN you might be interested in:
This article has a mistake. I actually ran the benchmark, and it doesn’t return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.
As I write this comment, the article’s numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
% rm -f benchmark && make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
benchmark: Mach-O 64-bit executable arm64
minify : 1.02483 GB/s
validate: inf GB/s
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don’t suspect it’s running under an emulator.
Update, I re-ran with messe’s fix (from downthread):
That puts Intel at 1.17x and 1.15x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I looked at the generated NEON for validateUtf8 and it doesn’t look very well interleaved for four execution units at a glance. I bet there’s still M1 perf on the table here.
In the discussion about this post at Hacker News it has been pointed out that the stated numbers here appear to be based on running code compiled for X86 under Apple’s Rosetta translation. When compiled natively for ARM, the difference is apparently much smaller.
Jacobsays:
Thanks for the quick update on a Sunday afternoon!
I’m looking forward to seeing how you can best make use of the new hardware.
Bob Dobbssays:
You’ve known for over an hour that your benchmark was grossly flawed, and that your results are farcical.
This is embarrassing. If you had any credibility at all you’d at least put a mea culpa at the type, but if you’re cowardly just deleted this fullstop.
The “critics”, it turns out, were absolutely right. You wrote some lazy nonsense, and when called on it, made even worse lazy nonsense. Ouch.
You’ve known for over an hour that your benchmark was grossly flawed, and that your results are farcical.
I edited my code inside Visual Studio code. I opened a terminal within Visual Studio code and compiled there, not realizing that Visual Studio code itself was running under Rosetta 2. Whether it is “a gross” mistake is up to debate. I think it was an easy mistake to make…
It is Sunday here and I was with my family. I saw on Twitter that there was a mistake, and so I replied to the person that raised the issue that I would revisit the numbers. I did, a few hours later. Again: it is Sunday and I was with my family. The post was literally fixed the same day.
Yes. I made a mistake. I admit. I also corrected it as quickly as possible.
This is embarrassing. If you had any credibility at all you’d at least put a mea culpa at the type, but if you’re cowardly just deleted this fullstop.
I added a paragraph in this blog post that says: “(This blog post has been updated after a corrected a methodological mistake. I was running the Apple M1 processor under x64 emulation.)”
The older blog post contains a note that describes how I was in error.
How am I being cowardly?
At no point did I try to hide that I made a mistake. In fact, I state it openly.
The “critics”, it turns out, were absolutely right. You wrote some lazy nonsense, and when called on it, made even worse lazy nonsense. Ouch.
I was wrong about SIMD performance on the Apple M1.
I get stuff wrong sometimes, especially when I write a quick blog post on a Sunday morning… But even when I am super careful, I sometimes make mistakes. That’s why I always encourage people to challenge me, to revisit my numbers and so forth.
Tim Millersays:
Daniel, thank you for making simdjson available to the world. I think others would share my opinion that while rude, aggressive, and accusatory posts are unfortunately to be expected on the internet, no response is required. I hope this won’t discourage you from posting in the future. Don’t let the trolls get you down!
Maynard Handleysays:
Come on, dude that’s not necessary. There are few enough academics investigating ALL aspects of performance across a range of real life hardware.
Let he who is without sin cast the first stone; mote and beams; those remain wise words.
Nitsan Wakartsays:
“As for literary criticism in general: I have long felt that any reviewer who expresses rage and loathing for a novel or a play or a poem is preposterous. He or she is like a person who has put on full armor and attacked a hot fudge sundae or a banana split.”
― kurt Vonnegut
Paulsays:
Wow Bob, or is it Karen? Talk about over the top response and rude AF…
Roman Gershmansays:
Daniel, your work and research has an amazing impact on the engineering community.
You do not have to answer angry trolls that do not know you but gladly use any opportunity to humiliate and laugh at someone.
Anthony Cayetanosays:
Because the M1 (and I’m assuming its future iterations) is a SoC; Data Processing that needs SIMD (matrices, vectors, etc.) is delegated to other specialised units such as the on-die GPU and Neural Processor. IIRC, the M1 features new SIMD instructions that complement both the GPU and Neural Unit, as to what degree they are purposed for depends on how Apple uses them for their Metal 2 API. This type of distributed processing is the new modern and I believe it’s the way to go.
Paulsays:
To be completely fair, Kaby lake processors we’re released in August of 2016, a roughly 4-year-old processor compared to the just released M1. Ice lake processors are also a year old and as the test done by Nathan Kurz above shows, the ice lake processor, does a much better job. It exceeds validate results of the latest M1 results at roughly 34.4 GB/s.
I wonder how much the Intel performance is impacted due to Meltdown and Spectre patches. Ice lake solved some but not all of those issues.
That’s why I stress the dates of the MacBook. It is not “fair”.
But it is more complicated even than that because the M1 uses less power than comparable Intel processors. So you’d want to account for energy use as well… something I do not do.
Matthaus Woolardsays:
Since the GPU and CPU both share the same memory. What is the latency impact of dispatching SIMD heavy work to the GPU on the M1.
Brandonsays:
The M1 performed much better than I expected in SIMD benchmarks, and the difference between 128 and 256-bit vector widths was the reason I was initially skeptical about Apple’s performance claims. But looking at benchmarks makes me certain that Apple’s Macbooks are headed in the right direction.
I’m excited for the next iteration of Apple Silicon.
McDsays:
A late comment but did you use the Accelerate Framework? Apparently this taps additional SIMD units not available directly and can have a significant performance impact.
Typo: I am excited because I think it will drive other laptop makes to rethink their designs.
This is true, whether you are an x86 loyalist or indifferent, the old assumptions are all being turned on their heads. I think we will see even more progress from AMD and Intel now that Apple is here to shake up the rankings.
I wonder when we will see laptops supporting ARM SVE (NEON successor)
https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
Are you familiar with Arm SVE2, Daniel? Are these SIMD instructions only available on Neoverse server cores like Amazon’s Graviton?
I do not think that SVE, let alone SVE2, is publicly available.
AFAIK, SVE is currently available on the Fugaku supercomputer. However, you can’t exactly get one at NewEgg.
According to the roadmap published here, it appears the Neoverse-V1 and Neoverse-N2 will be the first two designs from ARM itself to sport SVE.
This article from AnandTech corroborates what I just said
SVE2 doesn’t explicitly show up on any of those public roadmap slides, so it’s probably a couple years out—at least in cores designed by ARM. Although, as AnandTech points out, “SVE” in the slide may actually refer to SVE2 in some cases.
ARM first disclosed SVE several years ago, but is only just now starting to make SVE-capable cores. I wouldn’t be surprised if we had to wait another few years to buy an end product that offers SVE2.
Even though the Neoverse-V1 is “available now,” that doesn’t mean I can go buy a machine sporting one. It means silicon vendors can license and start building chips around it. It’ll be some time before you see volume product.
Why such slow adoption? Wide SIMD in the CPU just wasn’t that important to cell phones. It’s too power hungry, and it was hard to keep the ARM CPUs fed. Dedicated accelerators were a better fit in that product space, particularly from an energy efficiency standpoint.
In a workstation or server, you have different set of constraints. And, now we have some decent interconnects.
Challenges remain: it’s one thing to plop down the functional units for these wide vectors. Managing power—both peak and transient—is another kettle of fish.
Neoverse-V1 is ARMv8.4-A + 2x 256-bit SVE. (and was finished this year)
Neoverse-N2 is ARMv8.5-A + 2x 128-bit SVE2. (and will be in finished form next year)
Of course, that means finished on Arm’s side, that means that we should expect Neoverse-V1 designs in 2021 and Neoverse-N2 designs in 2022.
That is great news.
It looks like there is compiler and emulator support for SVE/SVE2 but the only available silicon is the Fujitsu A64FX (pdf) with SVE.
You have identified an area that Apple/Amazon Arm64 silicon is playing catchup to x64 on both desktop and server: vectorized SIMD algorithms.
Calling this catchup is misleading. SVE/2 is not just wider NEON, it is a rethinking of how to design a vector ISA a for much better compiler auto-vectorization (A very rough figure of merit is 128-bit wide SVE would run a “broad suite” of autovectorized code about 1.3x faster than NEON).
If we want to use these sorts of terms, leapfrogging would be more appropriate.
I think you misunderstand RAD’s comment.
My feeling is that he was basing his statement on my (erroneous) earlier results.
I think that there is wide agreement that SVE is exciting new tech.
SVE2 looks great but we are not going to see it in mainstream silicon until the next generation of Apple and Amazon chips at best. In every other area, the Apple M1 and Amazon Graviton 2 seem to offer the best bang-for-the-buck over x64. Until Neoverse V1/N2 silicon is available, I don’t think we will see a business case for a scale-up in-memory column store like SAP HANA moving away from Intel.
Benchmarks using Daniel’s EWAH and/or Roaring Bitmap projects should be able to approximate when Arm ports make sense. We need more real-world SIMD-centric benchmarks; maybe Lucene/ElasticSearch, Apache Arrow, DuckDB, ClickHouse?
A somewhat comparable system to DuckDB ist Hyrise. We just compared the performance of the M1 chip. It’s impressive…
https://twitter.com/hyrise_db/status/1350179043375804420
Check this comment: https://news.ycombinator.com/item?id=25409535
https://news.ycombinator.com/item?id=25409535
It seems that A64FX is now being sold, but not sure how feasible that is.
Given the fact the NVIDIA is buying ARM there is no negligible chance
that they change licensing policies…
However may the idea of successful ARM laptops will push somebody to try the same stint with MIPS.
This could be an extremely interesting development.
You should read the HN comments to this post, which claim you made an error generating these numbers, and the correct values for M1 are 6.6 GB/s and 16.5 GB/s.
https://news.ycombinator.com/item?id=25408853
I have not personally verified, but that sounds more in line with what the hardware can do.
Veedrac: they were correct. I made a mistake.
Some people over on Hacker News seem to think you ran your test with Rosetta on, the x86 emulation mode. Also the library seems to have difficulty properly detecting ARM. When these couple things were corrected, the benchmarks are pretty different.
Otherwise, a thoughtful benchmark!
https://news.ycombinator.com/item?id=25408853
Glenn: yes. They are correct. I made a mistake. The blog post has been updated.
I know you don’t usually read it, and I don’t know why they didn’t leave a comment here, but there are a few comments on HN that suggest you might have significant bug in your benchmark: https://news.ycombinator.com/item?id=25408853.
The summary would seem to be that ARM64 isn’t being properly detected by the macros in the simdjson code, resulting in the executable using the “generic fallback implementation”. The simple fix is to add an explicit “-DSIMDJSON_IMPLEMENTATION_ARM64=1” to the compilation. With this, one of the commenters got “minify” at 6.64796 GB/s, and “validate” at 16.4721 GB/s, concluding “That puts Intel at 1.17x and 1.15x for this specific test”.
I believe that the big issue noted in the HN thread is that the Arm benchmarks appear to be using x86 code running under Rosetta. With real ARM64 code and more optimisation this gets the benchmarks to minify : 6.73381 GB/s and validate: 17.8548 GB/s so 1.16x and 1.06x.
That is correct. I mistakenly ran the M1 benchmarks under Rosetta 2. It is incredibly transparent…
Thanks Nate. I don’t follow hacker news, but the mistake was pointed out to me on Twitter.
I have revised the blog post. I was wrong.
I am happy to admit it.
Hi Daniel,
I see now that you got lots of notification besides me. Sorry for adding to the pile. To partially make up for it, I ran your benchmark on a MacBook Air with Ice Lake for a more direct comparison:
% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz
% ./benchmark
simdjson v0.7.0
Detected the best implementation for your machine: haswell (Intel/AMD AVX2)
loading twitter.json
minify : 7.47081 GB/s
validate: 34.4244 GB/s
I was hoping that we might be able to see the effect of AVX512, but I see now that the simdjson code doesn’t yet support it. If you happen to an unreleased version that has it, I’d be happy to test and report.
Comment from HN you might be interested in:
This article has a mistake. I actually ran the benchmark, and it doesn’t return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.
As I write this comment, the article’s numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
% rm -f benchmark && make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
benchmark: Mach-O 64-bit executable arm64
minify : 1.02483 GB/s
validate: inf GB/s
% rm -f benchmark && arch -x86_64 make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
benchmark: Mach-O 64-bit executable x86_64
minify : 4.44489 GB/s
validate: 5.3981 GB/s
Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don’t suspect it’s running under an emulator.
Update, I re-ran with messe’s fix (from downthread):
% rm -f benchmark && make && file benchmark && ./benchmark
c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11 -DSIMDJSON_IMPLEMENTATION_ARM64=1
benchmark: Mach-O 64-bit executable arm64
minify : 6.64796 GB/s
validate: 16.4721 GB/s
That puts Intel at 1.17x and 1.15x for this specific test, not the 1.8x and 3.5x claimed in the article.
Also I looked at the generated NEON for validateUtf8 and it doesn’t look very well interleaved for four execution units at a glance. I bet there’s still M1 perf on the table here.
https://news.ycombinator.com/user?id=bacon_blood
In the discussion about this post at Hacker News it has been pointed out that the stated numbers here appear to be based on running code compiled for X86 under Apple’s Rosetta translation. When compiled natively for ARM, the difference is apparently much smaller.
Thanks for the quick update on a Sunday afternoon!
I’m looking forward to seeing how you can best make use of the new hardware.
You’ve known for over an hour that your benchmark was grossly flawed, and that your results are farcical.
This is embarrassing. If you had any credibility at all you’d at least put a mea culpa at the type, but if you’re cowardly just deleted this fullstop.
The “critics”, it turns out, were absolutely right. You wrote some lazy nonsense, and when called on it, made even worse lazy nonsense. Ouch.
I edited my code inside Visual Studio code. I opened a terminal within Visual Studio code and compiled there, not realizing that Visual Studio code itself was running under Rosetta 2. Whether it is “a gross” mistake is up to debate. I think it was an easy mistake to make…
It is Sunday here and I was with my family. I saw on Twitter that there was a mistake, and so I replied to the person that raised the issue that I would revisit the numbers. I did, a few hours later. Again: it is Sunday and I was with my family. The post was literally fixed the same day.
Yes. I made a mistake. I admit. I also corrected it as quickly as possible.
I added a paragraph in this blog post that says: “(This blog post has been updated after a corrected a methodological mistake. I was running the Apple M1 processor under x64 emulation.)”
The older blog post contains a note that describes how I was in error.
How am I being cowardly?
At no point did I try to hide that I made a mistake. In fact, I state it openly.
I was wrong about SIMD performance on the Apple M1.
I get stuff wrong sometimes, especially when I write a quick blog post on a Sunday morning… But even when I am super careful, I sometimes make mistakes. That’s why I always encourage people to challenge me, to revisit my numbers and so forth.
Daniel, thank you for making simdjson available to the world. I think others would share my opinion that while rude, aggressive, and accusatory posts are unfortunately to be expected on the internet, no response is required. I hope this won’t discourage you from posting in the future. Don’t let the trolls get you down!
Come on, dude that’s not necessary. There are few enough academics investigating ALL aspects of performance across a range of real life hardware.
Let he who is without sin cast the first stone; mote and beams; those remain wise words.
“As for literary criticism in general: I have long felt that any reviewer who expresses rage and loathing for a novel or a play or a poem is preposterous. He or she is like a person who has put on full armor and attacked a hot fudge sundae or a banana split.”
― kurt Vonnegut
Wow Bob, or is it Karen? Talk about over the top response and rude AF…
Daniel, your work and research has an amazing impact on the engineering community.
You do not have to answer angry trolls that do not know you but gladly use any opportunity to humiliate and laugh at someone.
Because the M1 (and I’m assuming its future iterations) is a SoC; Data Processing that needs SIMD (matrices, vectors, etc.) is delegated to other specialised units such as the on-die GPU and Neural Processor. IIRC, the M1 features new SIMD instructions that complement both the GPU and Neural Unit, as to what degree they are purposed for depends on how Apple uses them for their Metal 2 API. This type of distributed processing is the new modern and I believe it’s the way to go.
To be completely fair, Kaby lake processors we’re released in August of 2016, a roughly 4-year-old processor compared to the just released M1. Ice lake processors are also a year old and as the test done by Nathan Kurz above shows, the ice lake processor, does a much better job. It exceeds validate results of the latest M1 results at roughly 34.4 GB/s.
I wonder how much the Intel performance is impacted due to Meltdown and Spectre patches. Ice lake solved some but not all of those issues.
That’s why I stress the dates of the MacBook. It is not “fair”.
But it is more complicated even than that because the M1 uses less power than comparable Intel processors. So you’d want to account for energy use as well… something I do not do.
Since the GPU and CPU both share the same memory. What is the latency impact of dispatching SIMD heavy work to the GPU on the M1.
The M1 performed much better than I expected in SIMD benchmarks, and the difference between 128 and 256-bit vector widths was the reason I was initially skeptical about Apple’s performance claims. But looking at benchmarks makes me certain that Apple’s Macbooks are headed in the right direction.
I’m excited for the next iteration of Apple Silicon.
A late comment but did you use the Accelerate Framework? Apparently this taps additional SIMD units not available directly and can have a significant performance impact.
@McD I am not very familiar with the Accelerate Framework. It appears to be a set of specialized functions.