, 14 min read
Parsing JSON using SIMD instructions on the Apple A12 processor
17 thoughts on “Parsing JSON using SIMD instructions on the Apple A12 processor”
, 14 min read
17 thoughts on “Parsing JSON using SIMD instructions on the Apple A12 processor”
Why not run the tests with the Skylake clocked at 2.5GHz instead? Test will still have caveats, but at least the numbers would be “real”.
You could rescale everything from 3.7 GHz to 2.5 GHz if you’d like. Given that we are essentially CPU bound, numbers scale with frequency linearly (to a good degree).
Actually the result as is is rather nice.
Apple has 3 SIMD units in the more recent chips, Intel has 2 SSE or AVX units. So you would expect, naively (and assuming perfect independence of all the important instructions) Apple at 2.5GHz to more or less match Skylake at 1.5*2.5GHz…
The next step would be Apple with SVE, but SVE (like first round AVX) is primarily FP, you’d really want SVE2.
My guess! is this year we get SVE (as two units, 256 wide) with SVE2 in two years. But what do I know?
You could also try using Apple’s JSON libraries. One would hope those are optimized (including for SIMD) though who knows? And they may be optimized for error-handling or a particular programming model rather than absolute performance?
Recent Intel has three SIMD units, and AMD Zen (128-bit units) and Zen2 (256-bit units) have 4.
However, the Intel units are not symmetric: not all operations can occur on all units, although some can such as logical operations and some integer math. So depending on the mix of operations, an Intel chip might behave like it has 1, 2 or 3 SIMD units.
I don’t think all of simdjson is vectorized, so the vector related scaling only affects a portion of the algorithm: the scaling of the other parts will depend on scalar performance.
Thanks for the clarification.
Given your statements, I’m then really surprised at the gap. Of course Apple is wider, but this doesn’t seem like code for which that would help THAT much.
Is this a case where the NEON intrinsics are just a nicer fit to the problem? Or where certain high latency ops (at least lookups and permutes, for example) run in fewer cycles on Apple?
What gap exactly? You mean the part where the Intel SSE throughput doesn’t exceed the A12 performance by 3.7/2.5?
The implementation has a scalar part and an SIMD part, so the problem doesn’t scale linearly with SIMD width (note also the AVX performance not being double the SSE performance on the same chip). So you can’t apply your SIMD width calculation to the overall performance. We already know A12 usually does more scalar work per cycle, so this can explain it.
Also, you can’t just count the number of SIMD EUs, because they are highly asymmetric on Intel and perhaps on Apple chips. If doesn’t matter that you have three EUs if you are primary bound by say shuffles which only one EU supports.
“If doesn’t matter that you have three EUs if you are primary bound by say shuffles which only one EU supports.”
OK, that’s the sort of thing I was after.
As far as I can tell the Apple cores are extremely symmetric except for the usual weirdness around integer division and multiplication. I’ve never seen anything to suggest asymmetric NEON support.
“We already know A12 usually does more scalar work per cycle, so this can explain it.”
This I’m less convinced by, in that I find it hard to believe either core is hitting even an IPC of 4. I’d expect that, even in carefully tweaked hand assembly, this is a hard problem to scale to high IPC.
Maybe I’m wrong! That’s just a gut feeling…
What does IPC > 4 have to do with anything?
A12 gets higher IPC (and higher “work per cycle” which is what we are really talking about) in general, but running at an IPC > 4 is not in any way a prerequisite for that.
In general A12 does better than pure frequency scaling would suggest: both because the A12 is more braniac (does more work per cycle), and because scaling distorts things like misses to L3 or DRAM which are at least partly measured not in cycles but in real time (or DDR cycles or whatever, that doesn’t scale with CPU frequency).
So if you are expecting an Intel chip at 3.7 GHz and an A12 chip at 2.5 GHz to perform in a ratio of 3.7/2.5 you’ll be disappointed most of the time and I don’t see any reason for this code to be different.
Nice work. Can you clarify your note about not optimizing for the A12? First, what do you mean by “ARM vs. Apple”? Weren’t they the same thing in this case? And what sort of optimizations did you not do for the A12 code? You used SIMD so I’m not sure what else what was on the table.
First, what do you mean by “ARM vs. Apple”?
It was a typo. It is Intel vs. Apple.
And what sort of optimizations did you not do for the A12 code? You used SIMD so I’m not sure what else what was on the table.
There are many design choices, there are often 10 different ways to implement a given function.
The fact that we use SIMD instructions for part of the code is no guarantee that we are making full use of the processor. It is very likely that someone who knows ARM well could beat our implementation… by an untold margin.
The AVX implementation received more tuning so it is less likely that you could beat it by much.
For example, our UTF-8 validation on ARM is likely slower than it should be and we even have better code samples (it is an issue in the project), we just did not have time to assess it.
Great article! Maybe a small typo in first section:
Should not it be:
Thanks.
In addition to Qualcomm/Apple you may also want to try the new ARM eMag core running at 3.3 GHz (32 cores).
Packet has a c2 type available with this CPU.
I actually own an Ampere box! (And I have covered it a few times on this blog.)
It does have lots of cores… but it is not really competitive in terms of single-threaded performance especially when NEON is involved.
(I am still a fan of the company and will find a way to buy their second generation systems.)
I did an experiment to reduce the amount of unnecessary work in stage 1. Rather than flatten the bitmap in flatten_bits, we can just write the whole bitmap as is. Stage 2 then decodes it in UPDATE_CHAR one bit at a time. A simplistic implementation shows the following speedups on an AArch64 server for the 4 json files: 0.8%, -2.4%, 5.1%, 7.1%. Branch mispredictions are higher of course, but it’s still faster overall.
While stage 1 achieves a great IPC of ~3 with very few branch mispredictions, the work it performs doesn’t seem to be worthwhile enough to really help stage 2. Like I mentioned before, adding code to skip spaces in the parser should simplify stage 1 considerably and give larger speedups.
Thanks. I will investigate this possible optimization.
Apple’s processor is what makes it unique & popular. It’s optimized so well in both Iphone & Mac.