In the first version, the compiler may have scheduled the first memory access to run in parallel with the second random calculation, and failed to do it in the second. Looking at the asm could shine some light on what’s going on.
Thanks. As I indicated (see Notes), there are run-to-run variations so you should expect to get different numbers.
Nathan Kurzsays:
Since no-one else has speculated on why the observed results are what they are, I’ll take a stab. My guess is that the different functions are being limited by different factors. For “compute_two” and “compute_three”, we’re being limited by MLP. But for “compute_two_plus”, we’re not able to achieve maximum MLP because the extra instruction cause us to be limited by the size of the Reorder Buffer (ROB).
Looking at the assembly (and noting that I don’t actually know how to read ARM assembly) it looks like (only) the “compute_two” function has been unrolled 2x, which slightly reduces the loop overhead. If we presume the sequential reads from random[] are approximately free, we are issuing 4 uncached reads in 15 instructions. If we assume the ROB holds about 200 instructions (Skylake holds 224), this means we can fit about 12 copies of this loop into the ROB, and thus the processor can look ahead to see 48 memory accesses. If we assume the M1 can do 30 parallel accesses, and 48 > 30, this means we max out on MLP and are memory bandwidth limited.
The function “compute_three” is not unrolled, and the inner loop does 3 memory reads in 12 instructions. Applying the same logic, we can fit 15 copies of the loop into the ROB, and the processor has 45 memory accesses to choose from. 45 > 30, MLP limited. Since we are doing 50% more memory reads per iteration, we expect 50% greater time, which is just about exactly what we see.
The function “computer_two_plus” is not unrolled, and does 2 full memory reads in each 15 instructions. We’ll assume for now that the adjacent reads to the same cache line are free. If 12 copies of the loop fit in the ROB, this means the processor sees only 24 full memory reads at a time. 24 < 30, and thus we are not able to take full advantage of the possible MLP! Naively, we might expect to see about 25% slowdown from this. In actuality it’s a bit more, but I’m willing to hand wave this difference away because our assumption that the adjacent accesses are free is probably not quite true.
I’m guessing at the exact numbers here, but my guess would that that something like this explains the numbers you are seeing. But an argument against this theory are claims that the M1 actually has a much deeper ROB: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2. They suggest that it’s more than 600! I haven’t really looked at what they are measuring, though, and I feel like your example suggests otherwise. It might be interesting to try some other methods of increasing the per-iteration instruction count, and seeing whether they cause the same slowdown as the adjacent accesses.
@Nathan : Thanks for the analysis. I will happily test any code you’d like me to test…
My only claim here is that the naive memory-access model fails. I have not investigated further to see what the limiting factors are. I often run tests under Docker with perf with macOS… but, to my knowledge, Docker still hasn’t released a version of Docker for the M1. I could run the benchmark under Xcode and get performance counters there but it is not as convenient to me.
Not importantsays:
From what I saw of apple doing , it seem that if some don’t use apple stuff , exemple , Xcode swift etc ! Performance is lower , likely some optimisation apple do to their own ! So far most optimisation outside apple are mainly back end ! But as apple have showed , their way work , I mean we have a lot of exemple online how performing it is ! I believe. The neural engine is the big underestimated portion!
The problem is that the CPU can speculatively execute the memory access. That is, the branch predictor will help to unroll your loop and execute all the memory access at once so that the memory latency is
as low as 10ns.
The right way to do this benchmark is generating indexes by idx = random[i] ^ answer for preventing speculation.
@Chenyao : You are suggesting that we turn the benchmark into a pointer chasing benchmark, but if you follow the link for the memory-level parallelism, you will notice that this is how we do it to measure memory parallelism… we just create several independent lanes that work as you describe…
This being said, the benchmark described at the end of this blog post was deliberately designed, so it is not a mistake. That is, I specifically did want the processor to be able to start issuing new memory request ahead of time.
The pointer-chasing benchmark you suggest is also interesting and I might use such an approach in a follow-up post. Thanks!
@Daniel @Nathan. Sorry, I misunderstood the point. Thanks for point out that.
Then I would guess there is a limitation on #speculated memory accesses (cache line prefetch), so there is a difference between 2-wise and 3-wise. But due to some unknown reasons, the 2-wise+ performs similar to 3-wise. Kudo to Daniel for this finding!
Nathan Kurzsays:
You are right about the speculation, but I think you are missing that Daniel has intentionally designed his benchmark to allow this speculation. This is why he included the graph about Memory Level Parallelism and has lines like “Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies.”
Daniel’s goal in this benchmark is to maximize the available MLP so as to highlight the difference between the M1 and Intel chips. One benchmark isn’t more “right” than they other, it’s just that they are measuring different things. Your numbers are interesting, though, since they show how just much benefit there is from the speculation. And note that there is still speculation in your example, which is why making two parallel random read requests per iteration takes almost exactly the same amount of as making three.
To be clear, this benchmark is not meant to be used to compare different processors on different systems. It certainly was never meant to be run under Windows. Among other limitations is the fact that the rand() function can return values that are limited to a small range on some systems.
It is fine if folks want to run it on different system, but they have to make it sufficiently robust.
Even on macOS, the tool has limited robustness. Just enough to make the point…
Neosays:
Hi, this is really interesting. I want to understand code and memory optimizations , any suggestions for a beginner?
Hmmm, the value “bogus” is different than on M1.
rand() does not give the same results on different platforms.
I wonder how much this matters when comparing.
It would be better to use a deterministic random number generator.
Igorsays:
Hi,
I’ve run your tests on my Windows laptop (i7-8665U) and results are much better:
The variance of first test is pretty big (up to 1.4 ns) while the other two are stable.
Nathan Kurzsays:
Interesting, although I think it’s extremely unlikely that your results are correct. I’d suspect a bug. Did you change anything in the code to allow it to run on Windows? Maybe reducing the sizes to everything fits in L1? If not, I’m guessing there is something wrong with the Daniel’s time measurement code on Windows. It probably hasn’t been tested nearly as much as on Mac or Linux. Alternatively, maybe the compiler on Windows has figured out a way to defeat the benchmark by optimizing out the actual memory accesses? If you have time, I think it would be useful to figure out what’s happening here.
Warehouse: unified memory
Workshop: CPU, GPU, and other cores
Product( material): information, data
there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance.
The 2 vs 2+ results is indeed interesting. However, I’m a bit not sure about the point of the 3 test.
Should the 3 version be any different from the 2 version with a M that’s 50% larger? I feel like this should be the case (apart from the loop overhead) no matter what system this runs on. Since there’s no barriers or anything like that between loop iterations, shouldn’t the total number of memory references be the only thing that matters.
Hi,
This is interesting, I ran this on my Mac, with processor:2,2 GHz Quad-Core Intel Core i7
There are the results:
$ ./two_or_three
N = 1000000000, 953.7 MB
starting experiments.
two : 44.7 ns
two+ : 45.0 ns
three: 67.6 ns
bogus 137531640
Way too slow for my PC 🙂
Thanks for sharing.
Regards,
Jongi
You may want to upgrade to the Apple M1. It is a massively better processor.
Did you look at the compiled assembly code? That could be interesting too.
See https://gist.github.com/lemire/1c9e8827b45d057d7546e2743ad34496
In the first version, the compiler may have scheduled the first memory access to run in parallel with the second random calculation, and failed to do it in the second. Looking at the asm could shine some light on what’s going on.
See https://gist.github.com/lemire/1c9e8827b45d057d7546e2743ad34496
Hello,
Isn’t this also dependent of the memory’s speed?
The Apple M1 comes with builtin memory, so the memory speed is a constant.
Great read! Getting these results on my M1 Basemodel MBA (8GB/256)
two : 10.2 ns
two+ : 12.1 ns
three: 12.4 ns
Thanks. As I indicated (see Notes), there are run-to-run variations so you should expect to get different numbers.
Since no-one else has speculated on why the observed results are what they are, I’ll take a stab. My guess is that the different functions are being limited by different factors. For “compute_two” and “compute_three”, we’re being limited by MLP. But for “compute_two_plus”, we’re not able to achieve maximum MLP because the extra instruction cause us to be limited by the size of the Reorder Buffer (ROB).
Looking at the assembly (and noting that I don’t actually know how to read ARM assembly) it looks like (only) the “compute_two” function has been unrolled 2x, which slightly reduces the loop overhead. If we presume the sequential reads from random[] are approximately free, we are issuing 4 uncached reads in 15 instructions. If we assume the ROB holds about 200 instructions (Skylake holds 224), this means we can fit about 12 copies of this loop into the ROB, and thus the processor can look ahead to see 48 memory accesses. If we assume the M1 can do 30 parallel accesses, and 48 > 30, this means we max out on MLP and are memory bandwidth limited.
The function “compute_three” is not unrolled, and the inner loop does 3 memory reads in 12 instructions. Applying the same logic, we can fit 15 copies of the loop into the ROB, and the processor has 45 memory accesses to choose from. 45 > 30, MLP limited. Since we are doing 50% more memory reads per iteration, we expect 50% greater time, which is just about exactly what we see.
The function “computer_two_plus” is not unrolled, and does 2 full memory reads in each 15 instructions. We’ll assume for now that the adjacent reads to the same cache line are free. If 12 copies of the loop fit in the ROB, this means the processor sees only 24 full memory reads at a time. 24 < 30, and thus we are not able to take full advantage of the possible MLP! Naively, we might expect to see about 25% slowdown from this. In actuality it’s a bit more, but I’m willing to hand wave this difference away because our assumption that the adjacent accesses are free is probably not quite true.
I’m guessing at the exact numbers here, but my guess would that that something like this explains the numbers you are seeing. But an argument against this theory are claims that the M1 actually has a much deeper ROB: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2. They suggest that it’s more than 600! I haven’t really looked at what they are measuring, though, and I feel like your example suggests otherwise. It might be interesting to try some other methods of increasing the per-iteration instruction count, and seeing whether they cause the same slowdown as the adjacent accesses.
@Nathan : Thanks for the analysis. I will happily test any code you’d like me to test…
My only claim here is that the naive memory-access model fails. I have not investigated further to see what the limiting factors are. I often run tests under Docker with perf with macOS… but, to my knowledge, Docker still hasn’t released a version of Docker for the M1. I could run the benchmark under Xcode and get performance counters there but it is not as convenient to me.
From what I saw of apple doing , it seem that if some don’t use apple stuff , exemple , Xcode swift etc ! Performance is lower , likely some optimisation apple do to their own ! So far most optimisation outside apple are mainly back end ! But as apple have showed , their way work , I mean we have a lot of exemple online how performing it is ! I believe. The neural engine is the big underestimated portion!
The problem is that the CPU can speculatively execute the memory access. That is, the branch predictor will help to unroll your loop and execute all the memory access at once so that the memory latency is
as low as 10ns.
The right way to do this benchmark is generating indexes by
idx = random[i] ^ answer
for preventing speculation.See https://gist.github.com/louchenyao/75c3a6a3eeb0d7d9b1e8af7e18aacb03
The fixed benchmark result on my Macbook Air with the M1 processor is the following, which is pretty reasonable.
two : 130.3 ns
two+ : 131.0 ns
three: 133.2 ns
@Chenyao : You are suggesting that we turn the benchmark into a pointer chasing benchmark, but if you follow the link for the memory-level parallelism, you will notice that this is how we do it to measure memory parallelism… we just create several independent lanes that work as you describe…
https://github.com/lemire/testingmlp
This being said, the benchmark described at the end of this blog post was deliberately designed, so it is not a mistake. That is, I specifically did want the processor to be able to start issuing new memory request ahead of time.
The pointer-chasing benchmark you suggest is also interesting and I might use such an approach in a follow-up post. Thanks!
@Daniel @Nathan. Sorry, I misunderstood the point. Thanks for point out that.
Then I would guess there is a limitation on #speculated memory accesses (cache line prefetch), so there is a difference between 2-wise and 3-wise. But due to some unknown reasons, the 2-wise+ performs similar to 3-wise. Kudo to Daniel for this finding!
You are right about the speculation, but I think you are missing that Daniel has intentionally designed his benchmark to allow this speculation. This is why he included the graph about Memory Level Parallelism and has lines like “Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies.”
Daniel’s goal in this benchmark is to maximize the available MLP so as to highlight the difference between the M1 and Intel chips. One benchmark isn’t more “right” than they other, it’s just that they are measuring different things. Your numbers are interesting, though, since they show how just much benefit there is from the speculation. And note that there is still speculation in your example, which is why making two parallel random read requests per iteration takes almost exactly the same amount of as making three.
To be clear, this benchmark is not meant to be used to compare different processors on different systems. It certainly was never meant to be run under Windows. Among other limitations is the fact that the rand() function can return values that are limited to a small range on some systems.
It is fine if folks want to run it on different system, but they have to make it sufficiently robust.
Even on macOS, the tool has limited robustness. Just enough to make the point…
Hi, this is really interesting. I want to understand code and memory optimizations , any suggestions for a beginner?
Neo: keep reading my blog!!! 🙂
M1 is fast indeed.
For comparison, this is what I get with a i7-8700K CPU @ 3.7GHz, xubuntu-18.04.5:
$ ./two_or_three
N = 1000000000, 953.7 MB
starting experiments.
two : 16.5 ns
two+ : 18.0 ns
three: 24.6 ns
bogus 1422321000
With clang-11:
$ ./two_or_three
N = 1000000000, 953.7 MB
starting experiments.
two : 17.0 ns
two+ : 18.6 ns
three: 25.3 ns
bogus 1422321000
Hmmm, the value “bogus” is different than on M1.
rand() does not give the same results on different platforms.
I wonder how much this matters when comparing.
It would be better to use a deterministic random number generator.
Hi,
I’ve run your tests on my Windows laptop (i7-8665U) and results are much better:
c:\work\test>>two_or_three.exe
N = 1000000000, 953.7 MB
starting experiments.
two : 1.0 ns
two+ : 1.4 ns
three: 1.7 ns
bogus 1458643000
The variance of first test is pretty big (up to 1.4 ns) while the other two are stable.
Interesting, although I think it’s extremely unlikely that your results are correct. I’d suspect a bug. Did you change anything in the code to allow it to run on Windows? Maybe reducing the sizes to everything fits in L1? If not, I’m guessing there is something wrong with the Daniel’s time measurement code on Windows. It probably hasn’t been tested nearly as much as on Mac or Linux. Alternatively, maybe the compiler on Windows has figured out a way to defeat the benchmark by optimizing out the actual memory accesses? If you have time, I think it would be useful to figure out what’s happening here.
Apple M1 chip adopts “warehouse/workshop model”
Warehouse: unified memory
Workshop: CPU, GPU, and other cores
Product( material): information, data
there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance.
reference: Developer Delves Into Reasons Why Apple’s M1 Chip is So Fast
from: The Grand Unified Programming Theory: The Pure Function Pipeline Data Flow with Warehouse/Workshop Model
The 2 vs 2+ results is indeed interesting. However, I’m a bit not sure about the point of the 3 test.
Should the 3 version be any different from the 2 version with a M that’s 50% larger? I feel like this should be the case (apart from the loop overhead) no matter what system this runs on. Since there’s no barriers or anything like that between loop iterations, shouldn’t the total number of memory references be the only thing that matters.
However, I’m a bit not sure about the point of the 3 test. (…) shouldn’t the total number of memory references be the only thing that matters
Naively, you would think so, wouldn’t you? But that’s not what happens in the M1.
That’s the point of the three tests, to show that what you think might happen does not.
Hello! Thanks for sharing.
I have done my test on Ryzen 4500u on Lenovo Flex 5
and Ubuntu OS.
Here my results.
N = 1000000000, 953.7 MB
starting experiments.
two : 40.4 ns
two+ : 49.7 ns
three: 59.9 ns
bogus 1422321000
I think M1 has much more parallelism than you might think:
https://breaking-the-system.blogspot.com/2022/10/m1-is-much-faster-than-what-people.html
I got different results than you, but I might miss something here.