Daniel Lemire's blog

, 8 min read

Memory-level parallelism: Intel Skylake versus Intel Cannonlake

8 thoughts on “Memory-level parallelism: Intel Skylake versus Intel Cannonlake”

  1. bizude says:

    What Skylake CPU was tested? What RAM types were used in both systems? Asking for /r/Intel

    1. What Skylake CPU was tested?

      The Skylake microarchitecture is the last one we have had in a long time. All the recent Intel processors are based on Skylake.

      What RAM types were used in both systems?

      The skylake box has DDR4 (2133 MHz), the cannonlake box has LPDDR4 (3200 MT/s).

  2. bizude says:

    The skylake box has DDR4 (2133 MHz), the cannonlake box has LPDDR4 (3200 MT/s).

    Is there any way you could test using the same memory types? I would imagine that the bus differences, etc. between lpddr4 & desktop DDR4 could account for the latency difference.

  3. bizude says:

    I just realized Skylake doesn’t support LPDDR4 and Cannonlake doesn’t support normal DDR4. Welp, I guess we’ll be waiting for Sunny Cove before my question is answered.

  4. John says:

    It the bioses allow you can still set bandwidth and latency #s to the same to get a more equal comparison. Lp vs non lp doesn’t really matter ..

  5. Carl Nettelblad says:

    I went back to your previous stories, but it doesn’t seem this is addressed conclusively. Do you believe this is a feature of the core, or the memory controller? I guess mainly the core at these levels, but it would for sure be interesting to know when one would saturate the higher core count Xeons for the pathological cases of essentially random accesses into very large data structures. My empirical results on a “real” application so far indicate huge benefits from prefetching, huge pages and hyperthreading combined, but I would guess the code in practice maybe is able to maintain 2-4 actual independent lanes per core in this config.

    1. The number of outstanding requests is a feature of the processor.

    2. Travis Downs says:

      The short answer is “all of the above”.

      All parts of the path to memory play a part in the observed parallelism.

      For example, the core itself must have some number of “miss handling registers” to support multiple outstanding misses in L1 – otherwise, there could be no parallelism at the core level.

      Further along the path, the “uncore”, memory controller and DRAM itself support varying degrees of parallelism, all of which interact to produce the observed parallelism in this type of benchmark and also of course for real world code.

      Note that the parallelism isn’t simply the part of the path that itself supports the smallest parallel factor – it’s more complicated than that, since each part of the path has a different “occupancy time” – the shorter the time, the lower the required parallelism for a given occupancy level. For example, DRAM itself has fairly low intrinsic parallelism: after all, at the physical level there is only a single set of address and data busses etched on the motherboard per memory channel, and at most one thing can be passing over those busses at any given moment. Even there, however, you have a type of parallelism inside the DRAM chips which can have multiple open pages and accessing an open page is faster than a closed one.

      Backing up to the memory controller, these generally support many parallel requests. I don’t have an exact figure, but manuals for older Xeons indicated 32 requests per controller and I don’t think that figure has gotten any smaller. At the memory controller level parallelism helps in two ways: (1) the obvious way, which is the same for other parts of the path to memory, where it allows more requests in parallel increasing the total throughput via MLP and also (2) by having many requests visible to the controller at once, they can be rearranged so the DRAM is accessed in a pattern more friendly to the underlying hardware, e.g., accessing more open pages as discussed above.

      Finally, between the core and the memory controller you have the so-called “superqueue” which covers the path approximately from L2 to L3, and is thought to have 16 or so entries (but it is likely more now on CNL since we see a higher observed MLP factor).

      As far as multiple cores go, using many changes things dramatically. You can basically break the path above up into the “core private” and “shared” components per socket. The superqueue and everything closer to the core is private, and the L3/CHA, memory controller and DRAM are shared. Usually the shared resources aren’t the bottleneck for a single core, since they are sized for multiple cores – but once you get a few cores running at maximum required bandwidth, the shared components will become a bottleneck and the achievable per-core MLP will drop. The detailed be of the core-private stuff is usually the same with a micro-architecture, even across client and server parts, but the shared stuff varies a lot, all the way down to the specific characteristics and number of DIMMs you are using.