, 21 min read
Memory-level parallelism: Intel Skylake versus Apple A12/A12X
21 thoughts on “Memory-level parallelism: Intel Skylake versus Apple A12/A12X”
, 21 min read
21 thoughts on “Memory-level parallelism: Intel Skylake versus Apple A12/A12X”
would it be equivalent to convert these findings to a concept of “memory bandwidth”?
At the margin, we are bound by the available bandwidth.
That 2x lift between Skylake and the A12 is a little suspicious. Is the benchmark bottlenecked by memory bandwidth? Skylake uses DDR4-2133 when not overclocked, and the A12 seems to be using LPDDR4x (4266?). I’m not too familiar with how DDR RAM works, but I think LPDDR4 has a minimum access size of 256 or 512 bits, so bandwidth could become a problem. Based on your numbers, the A12 takes about one second for reading 256 MB with only one lane, and the maximum speedup is 25x. So, the highest transfer rate observed is 6.4 GB/s. That’s not too far from the maximum bandwidth especially if 75% of the transferred bits are discarded.
I’m not sure what you think you are calculating so I can’t comment, but Geekbench4 reports the A12’s memory bandwidth as either
– 14.9 GB/s (copy) or
-19.1 GB/s (“Bandwidth”).
(The former uses SYSTEM memcpy() and, more important, operates on a random mix of offsets and sizes that are supposed to match the distribution found across real programs; the latter is a balls-to-the-wall maximum “bandwidth”, though it’s unclear what the precise details are. I’m assuming equal reads and writes, with fairly large [multiple MB] between them, so minimal R/W turnaround cost.)
The DRAM used is Micron MT53D512M64D4SB-046.
The part of that that matters is the initial 53 (meaning Mobile LPDDR4) and the trailing -046 (meaning 2133MHz). (So best currently available LPDDR4, not some exotic Apple-exclusive higher speed.) This has a peak throughput of 17GB/s per 64 bits.
I THINK, but I am not at all sure about this, that Apple runs two memory controllers each controlling an independent 64-bit lane, which gives higher latency to an individual cache-line transfer, but allows more independent pages to be open since the pages on the two DRAM dies don’t have to be correlated. And the iPads do the same thing (two memory controllers, but each now control a 128-bit wide lane) so they get much better peak higher bandwidth (~31GB/s for A12X) but not much difference once latency starts dominating your throughput (~18GB/s for GB4’s memcpy for A12X).
The minimum access size may indeed be 256 bits (32 bytes) or 512 bits (64 bytes), but this generally doens’t matter for software since loads and stores to memory will occur in cache line sized chunks (64 bytes). Things would get weird if one day a memory technology had a minimum access size of more than 64 bytes, but that’s not likely for mainstream RAM exactly because of the cache line size.
Anyways, Skylake systems have a bandwidth of 30 to 100 GB/s generally (depending basically on the number of channels), and this test probably only gets to 10 GB/s or so and so I wouldn’t expect a bandwidth limitation. Indeed, the performance is “exactly as expected” for a system with 10 line fill buffers, as is fairly widely documented and the performance counters also back up this measurement.
In this context it is the Apple measurement that is surprising(ly good).
Not really. Spinning disks for instance can return 400,000 512 byte sectors a second. But only 100-150 or so if they are random. SSDs on the other hand can often return 50,000 or more random reads.
So random access for IO is often referred to as IOPS, and spinning disks only handle 50-150, but they handle sequential well. Often sequential performance is measured in MB/sec.
Same with memory, but there’s numerous issues. Does the memory access pattern fit in L1, L2, or L3 caches? Is it sequential? Random? On a few memory pages, or many?
Additionally on most platforms a single core, or maybe two can saturate the memory system for sequential access. But you need a significant number of independent processes to saturate the memory system with random requests. Usually somewhere around twice as many request streams as the number of memory channels, although that can change significantly platform to platform.
So generally referring to memory latency (accesses per second) is rather over simplified when referred to bandwidth. Bandwidth generally implies sequential access and the difference in parallelism, pattern, and performance can make huge differences in performance. When graphing these results the slowest performance is often in the GB/sec range and the highest end performance in the TB/sec range.
“The fact that the A12 has higher timings when using a single lane suggests that its memory subsystem has higher latency than a Skylake-based PC. Why is that?”
(a) Apple may be using drowsy caching more aggressively than Intel?
(b) Apple may be using one or more of the various techniques to amplify the coverage of caches, at the cost of some latency. (For example, though Apple may not be doing this [yet?], you can add compression to your LLC to give it effectively about twice the coverage, ~8% average performance boost [best case 20%], ~20% energy reduction, at the cost of ~10 cycles or so in latency.)
“The fact that the A12 has higher timings when using a single lane suggests that its memory subsystem has higher latency than a Skylake-based PC. Why is that?â€
Because the single lane test is pretty much the definition of memory latency.
Note that “memory latency” here refers to software-observed memory latency, not any particular thing you might be able to measure at the hardware level. So maybe some component of the memory subsystem has lower latency on the Apple chips, but then something else adds additional latency so that the observed figure is higher, who knows.
This isn’t news: most reports have shown the Apple A# chips to have longer latency than contemporary. That’s not necessary any kind of flaw, maybe it’s part of a smart tradeoff. Note for example that Intel server systems also have much longer latencies than their client systems, despite (and perhaps partly because of) their beefier “server class” memory subsystems.
In addition to “huge pages”, another factor that might be artificially limiting the Apple results is register spilling. The current test keeps track of each lane with a regular register, and the Apple processor supports 32 registers. As a result, because some additional registers are needed for testing, the compiled code starts spilling the lane registers to the stack once the number of lanes gets to the mid-twenties. Since this spillage involves extra stores and loads, it’s possible that the Apple chips have even more potential for parallelism than the testing so far shows. Does anyone have ideas for how to design a better test that would show this?
Starting from a simple pointer chasing test, you could design one that accesses N additional regions “in lockstep” with the original by adding offsets, N1, N2, N3, etc to the original pointer. These offsets can often be added “for free” in the addressing mode (up to a certain size at least), but even if not you just need to do the addition yourself which needs only a single temporary register for the entire loop.
You want to keep the offsets unequal so they don’t trigger prefetching, or at least far enough apart since accesses that are far apart also generally don’t trigger prefetching (for example, on Intel, accesses that are more than 2048 bytes apart don’t trigger prefetching, since the next access following that pattern would lie in another line anyways).
Cool. I’ve been making these measurements on various Intel/AMD platforms for awhile. One thing I wanted to mention was that you have to be careful to ensure that a random walk is a memory latency tester and not a TLB tester. One easy hack is to be very friendly to TLB, but still just as hard on the memory latency is to use a sliding window. I use a knuth shuffle modified to swap members of an array within a user defined range. That way you can tweak the benchmark from use memory latency to pure TLB test, or anywhere in between.
Oh and I also wanted to mention that the parallelism varies in L1, L2, L3, and of course per socket.
One easy hack is to be very friendly to TLB, but still just as hard on the memory latency is to use a sliding window.
I’d be interested in knowing more.
Presumably this means do all the accesses randomly, but within a sliding window of say 4 KiB or 16 KiB or whatever, the window sides by a N bytes after you access N bytes. So the number of TLB misses is as low as possible because once a page is brought in, it is used heavily while it is in the sliding window, then never used again.
A related scheme that I use is to do what I usually see call “page random” which is to randomly access every line within a page (or a range of pages), then move on to the next page. It’s the same as the sliding window except that the “sliding” is not fine-grained but jumps in page-sized quanta.
I thought it was simpler since you just do a “normal” shuffle on each page, but as Bill points out the existing shuffle algorithm is pretty easy to adopt to a sliding window (perhaps this introduces some kind of light bias near the edges, but this is fine for this purpose).
Having too small a sliding window may bring up different problems: Intel documentation talks of prefetchers which heuristically fetch adjacent cache line pairs under some circumistances. If the sliding window easily fits in caches and actual memory bus load is too low, this prefetcher may be activated. Also, sufficiently small window and small parallelism will affect the amount of DRAM pages being opened, which can reduce latency of read operations. All this has to be balanced with TLB misses, especially if huge pages can’t be employed.
Simply put: getting an unbiased picture of the memory subsystem operation by software benchmarkin is not easy.
From the wiki page, a Knuth shuffle:
for i from 0 to n−2 do
j ↠random integer such that i ≤ j < n
exchange a[i] and a[j]
I like that particular method because it guarantees I visit each cache line exactly once. But then I noticed strange behavior, looking at the CPU counters I realized the bottleneck wasn’t cache misses, but TLB misses.
TLBs aren’t nearly as well documented as caches, unfortunately. So I made a small modification.
j ↠random integer such that i ≤ j < 2048
So if I allocate a 1GB each member of the array is visited exactly once. But in the vast majority of cases they will be only on the same few pages. Sure a particularly unlucky item might get bumped ahead by up to 2047 several times, but generally that won’t happen often.
Of course you’ll likely want to replace by 2048 with a variable so you can control the TLB impact.
Oh, keep in mind that at the very end of the list i+j might be past the end of the array. I’d just replace 2048 with min(arraysize-i,2048) or similar.
A few other gotchas:
* Compilers are (to me surprisingly) good at cheating, I had to add code to randomly check to make sure it was safe to iterate for accurate timings.
* I tried visiting every cache line exactly once, but noticed anomalous readings. Turns out some platforms with certain settings (sometimes exposed in BIOS) will prefetch the next cacheline. So I visit half the cache lines exactly once.
With the above changes I’m getting exactly the numbers I expect. What was surprising to me is that with a 8 memory channel system the ideal parallelism for maximum throughput was 16 or so. In retrospect that makes sense, 8 threads cache miss through L1, L2, and L3. They sit in the memory controller waiting for a memory channel to free. There are small additional benefits up to 24 or so. Presumably because even with 16 threads, not all 8 memory channels are always busy. Occasionally less than 8 channels will be over committed, and 1 or more channels idle.
What does you access code look like after the shuffle?
Knuth suffle (aka Fisher-Yates) won’t guarantee you to visit each line once, unless you use a separate arrays to hold your indexes or something like that (but then you introduce a doubling of memory accesses). Usually you will fall into a short cycle and visit only a faction of the lines.
You probably want Satttolo’s algorith, which is conveniently described in the same article. The only difference is to replace
i ≤ j < n
withi < j < n
– but the effect is huge: you are guaranteed the indexes in the array make up a single cycle of the maximum possible length.I just use:
while (p != 0) {
p = a[p];
}
Ha, I don’t think I read that far, but indeed I stopped at i-1 to avoid problems. Didn’t realize that small tweak had a different name, until you mentioned it.
I don’t remember exactly the sequence, but I did compare the expected cache and TLB misses and they pretty much matched what I found with my code. That might well have been the reason I stopped at i-1.
It’s not the stopping point you need to adjust but the range you select your random number from.
While considering swapping element i out of a total array of size n, the Fisher-Yates shuffle choose j in the range [i, n-1] (inclusive), while to get Sattolo’s algorithm you need to choose in the range [i+1, n-1]. Note the +1.
It’s possible you are using Fisher-Yates but “get lucky” and happen to get a large cycle! It’s not that improbable. A simple way to check your code is to start at position 0 and chase until you see 0 again (just like your loop above) and see if it is equal to n.
Heh, yeah I got it right. I ran it a few 1000 times counting the times through the loop to make sure.
https://github.com/google/multichase
I wrote an entry on Hacker News highlighting the difference in load/load, store/store, load/store, and store/load memory reordering between ARM and x86. My overall theory is that ARM’s weaker memory ordering allows it to better use large buffers.
Intel may take cache misses in an out-of-order manner, but must put all operations back in order before retirement. ARM can stay fully out of order from cache-miss all the way to retirement. This suggests that Intel’s buffers are doing more work and are more expensive than ARM’s.