Really interesting, do you have an explanation for why there’s so much sensitivity to the “percache” tunable? I’d honestly expect not that much difference in working with 1, 2, or 16 integers in the same cacheline once the cache miss penalty has been paid…
Wayne Scottsays:
So your “law” is true. Even before your change to make each loop iteration dependant on the previous iteration the latency for each cache line was around 300 cycles. But processors can fetch multiple cache lines in parallel and so an OOO processor is very good at hiding this latency. Basically the same as your conclusion.
The law is true, indeed, but it can be misinterpreted.
foobarsays:
It is an understandable architectural optimization that cache line fills can provide data “early” instead of first reading the whole cache line (although I wonder how this goes with ECC memory). After all, SDRAM bandwidth, measured in whole cache lines, is sufficiently low to make this kind of a latency optimization worthwhile.
Even more interesting (or maybe just more esoteric) experiment would be to look at the effect of read order of the filled cache line. SDRAM burst ordering can bring a specific “critical SDRAM word” on the front of the read, but may shuffle the rest of the burst in a somewhat unintuitive manner. On top of all this, modern processors may reorder reads, which may improve performance in this scenario, but make it less obvious what is going on.
The clearest scenario is the following: reading bytes sequentially from beginning with a large alignment (such as a cache line boundary) onwards should result a sequential burst ordering, providing data on address-sequential manner. Starting the operation from some other alignment is likely not to do so, and a data-dependent address chaining might expose this fact.
This information may be valuable to an extent when performing multiple, possibly interdepent accesses to a data structure likely not to reside in the cache hierarchy. At least I can fathom some microbenchmarks to expose such implementation details.
foobarsays:
One such a microbenchmark would be to construct all circular linked lists of eight pointers, eight bytes each (both the native x64 pointer size as well as the most likely SDRAM word size), each list aligned on a cache line. Then simply traverse off-cache linked lists each by eight steps; assembly code couldn’t be simpler. I would expect that there are eight time-optimal traversal orders among those 43020 possible (taking the starting offset into account).
foobarsays:
One more addition: reading just two pointers on such lists would be likely expose most of the oddity involved. Say, lists which have their first read entry on the beginning of the cache line and the second entry on the end of it. It is important to keep these reads dependent; otherwise the DRAM controller might (I don’t really know if they do it in reality) be able to find a burst ordering which brings latency of reading all required words down.
Travis Downssays:
Interesting I hadn’t thought of critical word first stuff being a contributor here – does modern hardware still even use it? I had heard that they don’t: the high bandwidth and relatively long latnecy (and the wide busses which means a lot of data arrives at the same time anyways) would make the relative benefit quite small.
It would be an interesting test though. I’ll do it soon in uarch-bench.
FWIW my impression of the primary effect here, in the first set of “throughput” tests, is that the extra work done by the versions with more than one read ends up reducing the memory level parallelism by clogging both the ROB and load buffers with instructions and loads so that after the first miss the CPU can’t get far enough ahead to max out MLP.
GCC decided not to inline the test functions, but if you had the inline keyword it does, and the results are a lot closer, especially for 4 and 8, because the loop overhead disappears, cutting down instructions clogging the ROB. The 16 case probably maxes out the load buffers and is still fairly slow.
foobarsays:
Interesting I hadn’t thought of critical word first stuff being a contributor here – does modern hardware still even use it? I had heard that they don’t
Frankly, I’m not certain if they do. It should be checked out by benchmarking, but if any of the effects seen in the tests in this blog post actually result from the latency caused by simple bandwidth bottleneck on the memory bus (there definitely are other plausible explanations on modern hardware), the effect could be sufficient to provide a low-hanging fruit for optimization on purely random dependent reads, such as traversing large graphs or executing DFAs. Unless, of course, critical word ordering has an overhead which makes it impractical to be a benefit these days.
I quickly estimated that performance effect of this could be up to 10-20% (< 8 clock cycles), assuming sequential burst reads and critical word first reads are equal on all other aspects.
My knowledge on the subject is at least a decade old. Carefully constructed benchmarks could smash my assumptions…
Travis Downssays:
I started writing a test along the lines you suggested, but before I even got to the critical word (from RAM) part, I found another weired effect: when multiple loads hit the same L2 line (after missing in L1) only the first loads gets the “expected” L2 latency of 12 cycles. The rest of the loads get a significantly longer latency of 19 cycles.
This almost sounds like “critical word first” at the L2 to L1 level, but I don’t think it is: 7 extra cycles makes no sense for a 64-byte bus that should transfer the whole cache line in one shot and that has 64-byte bandwidth, and more importantly the effect occurs even if the other reads are for the exact same location.
More likely what happens is something like an optimized path for waking up the load that triggered the L1 miss and providing it the value directly, while the other loads just get woken up without receiving a value and have to read from L1 (although the difference of 7 cycles is even a bit longer than the usual L1 latency).
This effect would also explain some of part of Daniel’s results: e.g., part of the final latency test: there would be at least an additional 7 cycles added on (which BTW exactly explains 310 cycles vs 300 for 4 vs 1 adds: 7 cycles extra + 3 cycles of adds).
foobarsays:
Uh-oh, that’s very curious! I think I have heard of such an effect before, but I can’t really get a grasp where or when that would have been.
foobarsays:
I think the Intel glitch Travis found on RWT is worth noting in this thread too. Essentially, a trivial linked list sum generated by all mainstream compilers seems to generate code which runs unexpectedly 19 cycles per iteration, when a similar, but more convoluted piece of code does the same at 12 cycles per iteration when data is in L2:
Modern microarchitectures are occasionally strange indeed!
foobarsays:
Well, this is was partially a restatement of what Travis described above, but with a concrete, real-world example that probably would surprise many, more than a synthetic and partially a nonsensical benchmark written in assembly.
Also, this must have been the case of “an effect” I had run into earlier on. Many others must have also run into it…
Thank you. It appears that MSVC is better at vectorizing this code than GCC. According to your assembly, MSVC relies on SIMD instructions. Note that my post does report what happens if you vectorize: a much smaller difference. The smaller difference is apparently related to the fact that we use far fewer instructions.
I’d like to point out that even ignoring the caches, RAM delivers 128 bits / transfer on my PC. I have dual-channel DDR3 RAM, each channel delivers 64 bits / transfer. 128 bits is exactly an SSE register.
One possible explanation is Intel optimizes performance of their chips for their own compiler, AFAIK ICC is even better at vectorizing than MSVC. Unfortunately I don’t own an ICC to test that. So the performance of scalar RAM loads just wasn’t a priority for Intel (just like performance of e.g. x87 floating point math, modern compilers tend to use SSE & SSE2 instead of x87).
Another possible explanation is, it’s an unfortunate accident. CPU decodes memory load instruction, issues a load request, decodes another one, discovers it loads from the same line, repeats a few more times, and then hits a hardware limit of in-flight instructions, or in-flight RAM requests, and just stops and waits. Meanwhile with SSE there’re fewer instructions and fewer RAM loads, maybe the CPU does deeper pipelining.
Another possible explanation is the prefetcher silicon screwing things up. CPU tries to detect consecutive RAM reads and optimize them by prefetching next memory addresses. Maybe when CPU decoded 16 scalar load instructions reading consecutive addresses the prefetcher kicked in and started prefetching unneeded data, while 4 vector load instructions isn’t enough for that.
Optimization is hard. I’m not in academia, I’m a software developer often working on performance-sensitive code. I’ve experienced many weird things, e.g. couple months ago I’ve ported my relatively large C++ project from VS2015 to VS2017, and for some functions I saw up to 2x performance degradation, just because the compiler produced slightly different code from the same C++ source. I had to refactor a couple of things to maintain the performance (for that kind of work, a good profiler helps a lot).
Travis Downssays:
It’s not the prefetcher: the results are more or less the same with prefetching turned off.
IMO the parallel results (the first set Daniel presents, with large relative difference) are explained simply by the higher “percache” runs having less MLP. You can measure this with perf and see the difference: MLP drops as you move to accessing more elements per cache. For example, I get MLP factors of 5.4, 4.0, 2.9, 2.1, for percache of 1, 4, 8, 16, respectively. At least on my machine this largely explains the average performance difference of ~35, ~51, ~69, ~94 cycles per operation (i.e,. the MLP is very nearly proportional to performance).
You can measure this yourself with perf or ocperf with the events l1d_pend_miss_pending and l1d_pend_miss_pending_cycles: dividing the former by the latter gives the observed MLP.
The MLP is lowered with higher percache values because of the extra work done. E.g., percache 16 executes 9 million instructions for my test versus 3 million for percache 1. It works out to about 90 instructions per cache line. Since the ROB (reorder buffer) is only ~200ish instructions, it means you can only fit at most 224 / 90 = 2.5 cache-line misses in the instruction window, compared to 3x that number for the percache 1 case. Note that 2.5 is approximately the observed MLP: in fact, the ROB-based max MLP approximately bounds by above all the observed MLP numbers.
This also explains why vectorization is helpful: not because it is inherently “faster” (the time to execute the actual sums is fairly small compared to the observed per-line runtime, with or without vectorization) – but because it uses fewer instructions so the instruction window is much larger. For example, sumrandom in Daniel’s code, which is vectorized, only executes ~21 instructions per cache line despite summing the entire line, so the instruction window can fit roughly 10 or 11 lines and the observed MLP is 7.1: much higher than even percache 1.
Yongkeesays:
It appears to be a bit late to add comments here but I would buy this interpretation reasoning about the realized MLP. The only concern is how to measure it precisely.
It seems the way Travis suggsted with ocperf wrapper is a good enough to measure MLP for this problem but I wonder if there is any direct way to measure utilization of (L1) MSHR. I also wonder what if L2 MSHR is taken account. Would it increase or constraint MLP at the core/L1 level or not?
Could you add any comment on these?
aleccosays:
Something related and very interesting is the knock-on effect of kicking out data out of the cache. If your data is sparse every time you fetch a cache line and you want significantly less data (e.g. 4 bytes) you are kicking out of cache a previous cache line (e.g. 64 bytes).
For data structures benefiting from caching it’s imperative to have the minimum memory access units/nodes be as compact as possible within a cache line. If this is not possible and the effects of caching are negligible, consider using non-temporal memory access so you keep the existing cached data for some other parts of the program.
Travis Downssays:
Definitely. This is also one source of the sometimes big gap between microbenchmark and real world results.
Lets say you are tuning some lookup structure using a “realistic” microbenchmark (i.e., it reproduces the access pattern in your application, perhaps even using a trace of real application accesses). You’ll end up with something that balances cache use optimally only with the respect to the microbenchmark: i.e., kicking a line out only penalizes later iterations of the microbennchmark (if at all).
Drop that think into the real application and you’re now kicking out lines used by the application code that will run after your lookup, which might have a very different cost than it did in your benchmark. As always, microbenchmarks are useful but have to be used very carefully, especially when comparing implementations with different memory impacts.
Wayne Scottsays:
Something related and very interesting is the knock-on effect of kicking out data out of the cache.
When I was building benchmarks like this you would see this especially when testing array sizes that fit in the L2 cache. If you build a chain of pointers and then start walking the list immediately then all the data in the L2 is modified and then it gets pulled to the L1 cache it stays in the modified state. So then every read would need to do a dirty writeback and evict data from the L1 and put it back in the L2. This is much slower.
So I moved to writing a large junk array of data on the side to flush the cache before starting my measurements. Works much better if my goal is simple cache miss latencies.
That’s interesting: I wouldn’t necessary expect that. It would seem that L1 could have its own dirty flag, separate from the L2 MESI state, so it would only be written back if actually changed. It seems you are saying that the line in L1 will inherit the L2’s modified state, permanently (until the line is flushed out of L2). Hmm…
That is a 5.5X speed up, just by putting data closer together! I spent more time than I would like during my PhD tweaking an in-memory B-tree, which has similar performance characteristics.
Really interesting, do you have an explanation for why there’s so much sensitivity to the “percache” tunable? I’d honestly expect not that much difference in working with 1, 2, or 16 integers in the same cacheline once the cache miss penalty has been paid…
So your “law” is true. Even before your change to make each loop iteration dependant on the previous iteration the latency for each cache line was around 300 cycles. But processors can fetch multiple cache lines in parallel and so an OOO processor is very good at hiding this latency. Basically the same as your conclusion.
The law is true, indeed, but it can be misinterpreted.
It is an understandable architectural optimization that cache line fills can provide data “early” instead of first reading the whole cache line (although I wonder how this goes with ECC memory). After all, SDRAM bandwidth, measured in whole cache lines, is sufficiently low to make this kind of a latency optimization worthwhile.
Even more interesting (or maybe just more esoteric) experiment would be to look at the effect of read order of the filled cache line. SDRAM burst ordering can bring a specific “critical SDRAM word” on the front of the read, but may shuffle the rest of the burst in a somewhat unintuitive manner. On top of all this, modern processors may reorder reads, which may improve performance in this scenario, but make it less obvious what is going on.
The clearest scenario is the following: reading bytes sequentially from beginning with a large alignment (such as a cache line boundary) onwards should result a sequential burst ordering, providing data on address-sequential manner. Starting the operation from some other alignment is likely not to do so, and a data-dependent address chaining might expose this fact.
This information may be valuable to an extent when performing multiple, possibly interdepent accesses to a data structure likely not to reside in the cache hierarchy. At least I can fathom some microbenchmarks to expose such implementation details.
One such a microbenchmark would be to construct all circular linked lists of eight pointers, eight bytes each (both the native x64 pointer size as well as the most likely SDRAM word size), each list aligned on a cache line. Then simply traverse off-cache linked lists each by eight steps; assembly code couldn’t be simpler. I would expect that there are eight time-optimal traversal orders among those 43020 possible (taking the starting offset into account).
One more addition: reading just two pointers on such lists would be likely expose most of the oddity involved. Say, lists which have their first read entry on the beginning of the cache line and the second entry on the end of it. It is important to keep these reads dependent; otherwise the DRAM controller might (I don’t really know if they do it in reality) be able to find a burst ordering which brings latency of reading all required words down.
Interesting I hadn’t thought of critical word first stuff being a contributor here – does modern hardware still even use it? I had heard that they don’t: the high bandwidth and relatively long latnecy (and the wide busses which means a lot of data arrives at the same time anyways) would make the relative benefit quite small.
It would be an interesting test though. I’ll do it soon in uarch-bench.
FWIW my impression of the primary effect here, in the first set of “throughput” tests, is that the extra work done by the versions with more than one read ends up reducing the memory level parallelism by clogging both the ROB and load buffers with instructions and loads so that after the first miss the CPU can’t get far enough ahead to max out MLP.
GCC decided not to inline the test functions, but if you had the inline keyword it does, and the results are a lot closer, especially for 4 and 8, because the loop overhead disappears, cutting down instructions clogging the ROB. The 16 case probably maxes out the load buffers and is still fairly slow.
Frankly, I’m not certain if they do. It should be checked out by benchmarking, but if any of the effects seen in the tests in this blog post actually result from the latency caused by simple bandwidth bottleneck on the memory bus (there definitely are other plausible explanations on modern hardware), the effect could be sufficient to provide a low-hanging fruit for optimization on purely random dependent reads, such as traversing large graphs or executing DFAs. Unless, of course, critical word ordering has an overhead which makes it impractical to be a benefit these days.
I quickly estimated that performance effect of this could be up to 10-20% (< 8 clock cycles), assuming sequential burst reads and critical word first reads are equal on all other aspects.
My knowledge on the subject is at least a decade old. Carefully constructed benchmarks could smash my assumptions…
I started writing a test along the lines you suggested, but before I even got to the critical word (from RAM) part, I found another weired effect: when multiple loads hit the same L2 line (after missing in L1) only the first loads gets the “expected” L2 latency of 12 cycles. The rest of the loads get a significantly longer latency of 19 cycles.
This almost sounds like “critical word first” at the L2 to L1 level, but I don’t think it is: 7 extra cycles makes no sense for a 64-byte bus that should transfer the whole cache line in one shot and that has 64-byte bandwidth, and more importantly the effect occurs even if the other reads are for the exact same location.
More likely what happens is something like an optimized path for waking up the load that triggered the L1 miss and providing it the value directly, while the other loads just get woken up without receiving a value and have to read from L1 (although the difference of 7 cycles is even a bit longer than the usual L1 latency).
I put more details at RWT.
This effect would also explain some of part of Daniel’s results: e.g., part of the final latency test: there would be at least an additional 7 cycles added on (which BTW exactly explains 310 cycles vs 300 for 4 vs 1 adds: 7 cycles extra + 3 cycles of adds).
Uh-oh, that’s very curious! I think I have heard of such an effect before, but I can’t really get a grasp where or when that would have been.
I think the Intel glitch Travis found on RWT is worth noting in this thread too. Essentially, a trivial linked list sum generated by all mainstream compilers seems to generate code which runs unexpectedly 19 cycles per iteration, when a similar, but more convoluted piece of code does the same at 12 cycles per iteration when data is in L2:
https://www.realworldtech.com/forum/?threadid=178902&curpostid=178969
Modern microarchitectures are occasionally strange indeed!
Well, this is was partially a restatement of what Travis described above, but with a concrete, real-world example that probably would surprise many, more than a synthetic and partially a nonsensical benchmark written in assembly.
Also, this must have been the case of “an effect” I had run into earlier on. Many others must have also run into it…
With MSVC compiler, there’s very small difference when testing with 8GB vectors, only 8% between 1 and 16.
I think the problem is GCC optimizer doesn’t like that code for some reason.
Here’s my version that should be portable across compilers, scalar-only, C++, without assembly, and slightly more optimized: https://github.com/Const-me/MemoryAccessCosts
The results are in result-gcc.txt and result-msvc.txt in that repo. They’re from the same PC, I’m using gcc 5.4.0 running in WSL in Windows 10.
If you want to build my code, run
cmake .
thenmake
then./memory-access-demo
.Thank you. It appears that MSVC is better at vectorizing this code than GCC. According to your assembly, MSVC relies on SIMD instructions. Note that my post does report what happens if you vectorize: a much smaller difference. The smaller difference is apparently related to the fact that we use far fewer instructions.
I dunno what’s going on.
I’d like to point out that even ignoring the caches, RAM delivers 128 bits / transfer on my PC. I have dual-channel DDR3 RAM, each channel delivers 64 bits / transfer. 128 bits is exactly an SSE register.
One possible explanation is Intel optimizes performance of their chips for their own compiler, AFAIK ICC is even better at vectorizing than MSVC. Unfortunately I don’t own an ICC to test that. So the performance of scalar RAM loads just wasn’t a priority for Intel (just like performance of e.g. x87 floating point math, modern compilers tend to use SSE & SSE2 instead of x87).
Another possible explanation is, it’s an unfortunate accident. CPU decodes memory load instruction, issues a load request, decodes another one, discovers it loads from the same line, repeats a few more times, and then hits a hardware limit of in-flight instructions, or in-flight RAM requests, and just stops and waits. Meanwhile with SSE there’re fewer instructions and fewer RAM loads, maybe the CPU does deeper pipelining.
Another possible explanation is the prefetcher silicon screwing things up. CPU tries to detect consecutive RAM reads and optimize them by prefetching next memory addresses. Maybe when CPU decoded 16 scalar load instructions reading consecutive addresses the prefetcher kicked in and started prefetching unneeded data, while 4 vector load instructions isn’t enough for that.
Optimization is hard. I’m not in academia, I’m a software developer often working on performance-sensitive code. I’ve experienced many weird things, e.g. couple months ago I’ve ported my relatively large C++ project from VS2015 to VS2017, and for some functions I saw up to 2x performance degradation, just because the compiler produced slightly different code from the same C++ source. I had to refactor a couple of things to maintain the performance (for that kind of work, a good profiler helps a lot).
It’s not the prefetcher: the results are more or less the same with prefetching turned off.
IMO the parallel results (the first set Daniel presents, with large relative difference) are explained simply by the higher “percache” runs having less MLP. You can measure this with perf and see the difference: MLP drops as you move to accessing more elements per cache. For example, I get MLP factors of 5.4, 4.0, 2.9, 2.1, for percache of 1, 4, 8, 16, respectively. At least on my machine this largely explains the average performance difference of ~35, ~51, ~69, ~94 cycles per operation (i.e,. the MLP is very nearly proportional to performance).
You can measure this yourself with perf or ocperf with the events l1d_pend_miss_pending and l1d_pend_miss_pending_cycles: dividing the former by the latter gives the observed MLP.
The MLP is lowered with higher percache values because of the extra work done. E.g., percache 16 executes 9 million instructions for my test versus 3 million for percache 1. It works out to about 90 instructions per cache line. Since the ROB (reorder buffer) is only ~200ish instructions, it means you can only fit at most 224 / 90 = 2.5 cache-line misses in the instruction window, compared to 3x that number for the percache 1 case. Note that 2.5 is approximately the observed MLP: in fact, the ROB-based max MLP approximately bounds by above all the observed MLP numbers.
This also explains why vectorization is helpful: not because it is inherently “faster” (the time to execute the actual sums is fairly small compared to the observed per-line runtime, with or without vectorization) – but because it uses fewer instructions so the instruction window is much larger. For example, sumrandom in Daniel’s code, which is vectorized, only executes ~21 instructions per cache line despite summing the entire line, so the instruction window can fit roughly 10 or 11 lines and the observed MLP is 7.1: much higher than even percache 1.
It appears to be a bit late to add comments here but I would buy this interpretation reasoning about the realized MLP. The only concern is how to measure it precisely.
It seems the way Travis suggsted with ocperf wrapper is a good enough to measure MLP for this problem but I wonder if there is any direct way to measure utilization of (L1) MSHR. I also wonder what if L2 MSHR is taken account. Would it increase or constraint MLP at the core/L1 level or not?
Could you add any comment on these?
Something related and very interesting is the knock-on effect of kicking out data out of the cache. If your data is sparse every time you fetch a cache line and you want significantly less data (e.g. 4 bytes) you are kicking out of cache a previous cache line (e.g. 64 bytes).
For data structures benefiting from caching it’s imperative to have the minimum memory access units/nodes be as compact as possible within a cache line. If this is not possible and the effects of caching are negligible, consider using non-temporal memory access so you keep the existing cached data for some other parts of the program.
Definitely. This is also one source of the sometimes big gap between microbenchmark and real world results.
Lets say you are tuning some lookup structure using a “realistic” microbenchmark (i.e., it reproduces the access pattern in your application, perhaps even using a trace of real application accesses). You’ll end up with something that balances cache use optimally only with the respect to the microbenchmark: i.e., kicking a line out only penalizes later iterations of the microbennchmark (if at all).
Drop that think into the real application and you’re now kicking out lines used by the application code that will run after your lookup, which might have a very different cost than it did in your benchmark. As always, microbenchmarks are useful but have to be used very carefully, especially when comparing implementations with different memory impacts.
When I was building benchmarks like this you would see this especially when testing array sizes that fit in the L2 cache. If you build a chain of pointers and then start walking the list immediately then all the data in the L2 is modified and then it gets pulled to the L1 cache it stays in the modified state. So then every read would need to do a dirty writeback and evict data from the L1 and put it back in the L2. This is much slower.
So I moved to writing a large junk array of data on the side to flush the cache before starting my measurements. Works much better if my goal is simple cache miss latencies.
before: https://www.dropbox.com/s/raqv6eubcti0ir3/mem_lat1.jpg?dl=0
after: https://www.dropbox.com/s/t7ebunoxpzzyc18/mem_lat3.jpg?dl=0
That’s interesting: I wouldn’t necessary expect that. It would seem that L1 could have its own dirty flag, separate from the L2 MESI state, so it would only be written back if actually changed. It seems you are saying that the line in L1 will inherit the L2’s modified state, permanently (until the line is flushed out of L2). Hmm…
I will check out your code.
Seems to me this means memory locality matters a lot! Ignoring the SIMD version, my estimate of the actual “work” performed is:
Integers/line | Cycles/integer | Original cycles/line
1 | 38 | 38
4 | 15 | 60
8 | 8.75 | 70
16 | 6.875 | 110
That is a 5.5X speed up, just by putting data closer together! I spent more time than I would like during my PhD tweaking an in-memory B-tree, which has similar performance characteristics.
memory locality matters a lot!
Undoubtedly.