Daniel Lemire's blog

, 65 min read

Measuring the memory-level parallelism of a system using a small C++ program?

56 thoughts on “Measuring the memory-level parallelism of a system using a small C++ program?”

  1. Alun J. Carr says:

    With Ye Olde Core i7-3667U (Ivy Bridge) in my mid-2012 MacBook Air, I get a value of 7, for what it’s worth.

  2. Very interesting.
    I have seen this beening exploited by cuckoo hashing where two indipendent locations are accessed in parallel through speculative execution. Now the questions is, if modern CPUs are able to access up to 10 locations at the same time, why are we using only two locations for cuckoo hashing? It might makes sense to revisit a bit… 😀

    1. Travis Downs says:

      Despite chips supporting some MLP, it’s not free in any sense to check 10 locations (especially when they are not contiguous) rather than 2. You only need two locations to get the “power of choice” effect that dramatically reduces the collision rate, making this type of hashing feasible in the first place.

      Also, there is a temptation to analyze everything in isolation, eg the effect of a single hash lookup. Maybe you find that in a latency-sense for a single lookup, using a large number of parallel lookups is fastest since you fully exploit the available MLP. Then as soon as you put that implementation into some real code that does a bunch of back-to-back lookups, you find that it’s slower – because the “unused” MLP in a single lookup was actually being used effectively across different lookups. Ultimately there might not be a single best approach: one approach might be fastest when used in a place where adjacent operations can also exploit MLP, while a other might be faster when that’s not the case (sometimes this division corresponds to the throughput vs latency distinction but not always).

      1. Going from 2 accesses to 3 accesses, though not “free”, is maybe far more practical than it appears.

        1. Travis Downs says:

          Yes, free, but in the context of this specific test using independent chains which is designed to measure MLP. If anything is going to show the third access being free, it is this type of code.

          Increasing the number of probed locations in a hashing algorithm is another thing entirely. These is a bunch of work beyond the access, and the accesses aren’t part of long independent chains, but start from a common point and converge after a single access. So it is unlikely to be free, although how much it actually costs depends on a lot of factors, including the surrounding code.

          1. So we are clear, I did not write that it would be free. I wrote that an additional memory access is maybe less expensive than you might think if you do not take parallelism into account.

            1. Travis Downs says:

              Ooops, I misread it: I didn’t see the “not” in front of free.

              So I still think though that just adding additional probes to cuckoo hashing is probably not very useful in general.

              My feeling about hashing is that the optimal approach would take advantage of SIMD to do many comparisons at once, and make the common found and not found approaches constant time and predictable (i.e., not unpredictable branches when you find the element even if it is not in its “nominal” slot). So something like a bucketized hash.

              Then you can still layer on a clever reprobing strategy like cuckoo or some variant, but mostly as a fall-back case when you have a lot of collisions.

              1. My feeling about hashing is that the optimal approach would take advantage of SIMD to do many comparisons at once, and make the common found and not found approaches constant time and predictable. So something like a bucketized hash.

                I would definitively prefer SIMD on a single cache line, than two cache misses. I am certainly not arguing for more cache misses.

              2. Nathan Kurz says:

                So I still think though that just adding additional probes to cuckoo
                hashing is probably not very useful in general.

                Huh, I’m with Antonio here, and think that cuckoo hashes are the perfect example of how MLP can be exploited for big performance gains. Use SIMD to generate parallel hashes, use MLP to gather the lookups, and then use SIMD again to check the results. Win, win, win! I’m surprised that you and Daniel don’t see the advantage.

                You are obviously correct that adding additional probes is not completely free and that multiple probes does not speed up access. But you seem to be missing the big advantages of having more places to put things, which are reduced memory size (better fill ratios) and reduced work on insertions. If you can get (almost) equal access times for read-only work, you get big wins everywhere else.

                I think the theory is really solid on this. A good search term is “d-ary cuckoo hash”, where “d” is the number of parallel slots. The original paper on it is Fotakis, Pagh, Sanders, and Spirakis 2005: http://www.cs.princeton.edu/courses/archive/fall09/cos521/Handouts/spaceefficient.pdf

                Their conclusion is that going from d=2 to d=4 is a big gain: “We also provide experimental evidence which indicates that d-ary Cuckoo Hashing is even better in practice than our theoretical performance guarantees. For example, at d=4, we can achieve 97% space utilization and at 90% space utilization, insertion requires only about 20 memory probes on the average, i.e., only about a factor two more than for d=∞.”

                Viewed alternatively, part of the reason that cuckoo hashing works so well is that even with d=2, we are able (if designed correctly) to take advantage of parallel memory accesses. But going even a little farther (from 2 to 4 or 8) costs us very little for read speeds, while giving lots of benefit for writes. Nicely, the theoretical sweet spot matches well with the current limits for MLP.

                1. Travis Downs says:

                  Huh, I’m with Antonio here, and think that cuckoo hashes are the
                  perfect example of how MLP can be exploited for big performance gains.
                  Use SIMD to generate parallel hashes, use MLP to gather the lookups,
                  and then use SIMD again to check the results. Win, win, win! I’m
                  surprised that you and Daniel don’t see the advantage.

                  You are obviously correct that adding additional probes is not
                  completely free and that multiple probes does not speed up access. But
                  you seem to be missing the big advantages of having more places to put
                  things, which are reduced memory size (better fill ratios) and reduced
                  work on insertions. If you can get (almost) equal access times for
                  read-only work, you get big wins everywhere else.

                  First, when considering hash tables, I should note that I biased towards evaluating lookups: so I’m pretty much evaluating how fast the lookups are as the key criteria. That might not be right for every use: if you have insert and delete heavy hash tables, you’ll obviously care about the performance of those operations.

                  Maybe I got that way because in many simple hash tables, “lookup” and “insert” are pretty much the same thing: to insert you just do a lookup and if you don’t find the element you are already at the insertion point, so the performance is very similar. That is not necessarily the case for cuckoo, where inserts can have very asymmetric compared to lookups (especially if you are pushing the load factor).

                  So in that view, exploiting MLP to created a d-ary Cuckoo is far from a “perfect” example, because it possibly provides no speedup at at all for hash lookups. The effect is only very indirect: you might get some speedup due to a higher load factor, but since even 2-ary cuckoo hashes support a fairly high load factor this benefit is limited, and will mostly occur only at certain thresholds where the 2-ary cuckoo slips into the next cache level, but the 3-ary or d-ary cuckoo fits in the higher level cache. It is not at all clear to me that benefit can overcome, on average, the penalty of more accesses.

                  I’ll admit to having a wrong idea of how high load factors get in 2-ary cuckoo hash tables though: I thought they approached 80% to 90% before rehashing, but a skim (and the paper you linked) seems to indicate it would be below 50%, so there is some room for better space usage.

                  To me, the “perfect” example of MLP gets approximately a factor of N improvement by doing N memory accesses in parallel, and examples like that exist (many like linear search work like that without the author consciously being aware of MLP though). So I’m not even saying that a 3-ary cuckoo wouldn’t be a win in some cases, but that is very far from an obvious example of how we can exploit MLP for a big win (to me).

                  The original paper on it is Fotakis, Pagh, Sanders, and Spirakis 2005

                  I had read it before, I think, but only skimmed it again now. The main point seems to be that you can get might higher load factors, approaching 100% with d-ary caches, right? They seem to be primarily evaluating it on that basis, not on performance.

                  and at 90% space utilization, insertion requires only about 20 memory probes on the average

                  This sort of contradicts the notion that d-ary caches are making insertion fast, doesn’t it? If you are doing twenty probes on average to insert your element, you are not going to have fast inserts, even if you try to exploit MLP. You should look for less than 2 probes on average.

                  I’m not saying the paper is wrong: I’m saying it is targeting something different (optimal space usage) that what most people want from a hash table.

                  But going even a little farther (from 2 to 4 or 8) costs us very little for read speeds, while giving lots of benefit for writes.

                  Did you try it? I’m not convinced about either part. For one thing, inserts also need to do all the work of a lookup, so if you make lookups slower by doing more probes, you also also make that part of inserts slower. The “cuckoo” part of inserts, where you push elements around until you find an empty space, might be faster, but is this really a large part of the insert cost for a properly sized table?

                  Also, the paper you linked makes it clear that if you want to use the d-ary nature to crank up the load factor, you are doing to be probing a lot of locations on insert.

                  Viewed alternatively, part of the reason that cuckoo hashing works so well is that even with d=2, we are able (if designed correctly) to take advantage of parallel memory accesses.

                  Yes, although this is common to many fast hashing techniques as well. For example, people keep rediscovering that chaining is slow, compared to various open-addressing schemes – and that the slowness is worse than one would expected if you just considered the number of probes – and this is in part because chaining has zero MLP, but most (all?) open-addressing schemes naturally get approximately unlimited MLP.

                  I have a couple more things to say, but man this is already getting long … it’s a weakness of mine. Overwhelm the other party with a wall of text :).

                  1. Nathan Kurz says:

                    To me, the “perfect” example of MLP gets approximately a factor of N
                    improvement by doing N memory accesses in parallel

                    Sure, this is great when you you can get it. But I’d call it a “perfect example” if it describes a situation where a better algorithm that naively looks much more expensive (N * x) turns out to come at an affordable price (1.1 * x).

                    The “cuckoo” part of inserts, where you push elements around until you
                    find an empty space, might be faster, but is this really a large part
                    of the insert cost for a properly sized table?

                    Yes, while it depends on what you mean by “properly sized”, I think this turns out to be a dominant cost. At anything approaching “full”, the costs of insertion are going to be dominated by the length of the eviction chains. And while it’s good to have all your lookups be 1 cycle faster, this might be unaffordable if it means that 1 insert of a million requires a full rehash requiring a few seconds to finish.

                    Arguing against myself (saying that 2-ary is sufficient), it’s possible that there are better ways to achieve resilience while still keeping high memory density. One is to have a “stash” table that can also be used as a last resort for anything that can’t otherwise fit. Another is to use larger buckets per hash that take advantage of cacheline sized transfers. This 2006 Microsoft paper by Erlingsson, Manasse and McSherry that larger buckets is better than more hashes: https://www.ru.is/faculty/ulfar/CuckooHash.pdf

                    If you are doing twenty probes on average to insert your element, you
                    are not going to have fast inserts, even if you try to exploit MLP.
                    You should look for less than 2 probes on average.

                    I think the goal of “fast inserts” is really to make the worst cases very rare, and not quite as expensive. At a 1000:1 read to write ratio, 20 probes with low variance might be a fine solution, especially if it gives you the ability to “upgrade in place” rather than requiring a “stop the world” rehash event.

                    I have a couple more things to say, but man this is already getting
                    long … it’s a weakness of mine. Overwhelm the other party with a wall
                    of text :).

                    Keep them coming — I like your walls of text, especially when I’m searching for something a year later and I find you’ve already given a complete answer.

                    1. Travis Downs says:

                      Sure, this is great when you you can get it. But I’d call it a “perfect example” if it describes a situation where a better algorithm that naively looks much more expensive (N * x) turns out to come at an affordable price (1.1 * x).

                      Yes, that’s a perfect example of MLP causing a surprising result, but it may not be an example of MLP making a good algorithm faster or where a better than expected implementation was available due to MLP which I what I thought we after.

                      As an example, I could take any algorithm an add some redundant accesses that don’t really do anything, and it might not be as slow as you might think due to MLP (in the extreme, it might be the same speed as the original algorithm). So that would be a perfect example of MLP making something faster than expected but not actually any better because the new accesses are useless.

                      It’s not just a vapid example: it’s more or less how I saw the cuckoo suggestion: of course you can probe more locations, but why? It doesn’t obviously speed up probing (and I think it will slow it down). It may speed up inserts, and it may speed up probing indirectly due to better caching, but I would definitely classify that as a subtle argument versus a perfect example.

                      Yes, while it depends on what you mean by “properly sized”, I think this turns out to be a dominant cost. At anything approaching “full”, the costs of insertion are going to be dominated by the length of the eviction chains. And while it’s good to have all your lookups be 1 cycle faster, this might be unaffordable if it means that 1 insert of a million requires a full rehash requiring a few seconds to finish.

                      I admit to not knowing what is the dominant cost for inserts. My “properly sized” I mean that we assume the user isn’t stupid and so sets his rehash threshold appropriately for his workload. For example, if he cares about inserts mostly, then probably he will do the rehashing more often to keep the chains lower (both for 2-ary and other arities).

                      So given that, I that properly sized 2-ary and N-ary will both have reasonable insert behavior, and about the same number of rehashes, but that the 2-ary table will just be bigger because it can sustain a lower load factor. Sustain is really the wrong word: what I mean is that a properly sized table will have a lower load factor for 2-ary (indeed, support for higher load factors is the primary advantage of higher arities).

                      I think the goal of “fast inserts” is really to make the worst cases very rare, and not quite as expensive. At a 1000:1 read to write ratio, 20 probes with low variance might be a fine solution, …

                      Well if you have a 1000:1 read to write ratio, then you really don’t want to add probes to your read case! I’d argue there you just optimize for reads.

                      that larger buckets is better than more hashes

                      Yes, that’s basically what I meant by “bucketized” hash above, although I don’t know what the best way to handle overflow is. Probably it depends on the use case: if you don’t need to support deletes (or are fine with slow deletes), for example, many strategies become simpler or more efficient.

                      … especially if it gives you the ability to “upgrade in place” rather than requiring a “stop the world” rehash event.

                      As above, I think this is the wrong way to look at it. Rehashes are going to occur in 2-ary and 3-ary tables. It doesn’t qualitatively change the basic tradeoffs and worst-case edges for cuckoo. If you compare a 2-ary table that needs to rehash all the time and destroying your SLAs due to stop-the-world events vs a 3-ary one that never does that, then you are “doing it wrong” somehow with the 2-ary case.

                      Keep them coming

                      OK, I’ll add one per reply so things don’t get out of hand with a giant branching conversation :).

                      Here’s one that is not too convincing but it is worth mentioning.

                      I thought (I could be wrong) the original post sort of suggested the extension to a 3-ary (or more generally N-ary) search was “obvious”, i.e., in the same way that binary search is extended to 3-ary search and beyond.

                      It’s not so simple though! 2-ary is special in cuckoo because it means that the insertion search is linear: there is only one (other) place to put each element so your path is set, and you can detect cycles and so on (except for the first choice, where you can kick out either element)

                      3-ary is already much more complicated: when you are evicting an element, you have 2 possible alternate places to put it. So it’s a tree search, and detecting cycles is more complicated, etc.

                      So to be really concrete about everything, we should specify how the 3-ary cuckoo is actually implemented, because those choices matter. The paper you linked, for example, suggests a random walk and then just stop after some threshold number of nodes have been visited – but that will certainly have an additional cost versus the simple search for the 2-ary case.

  • it’s not free in any sense to check 10 locations (especially when they are not contiguous)

    If they run in parallel (since independent), latency-wise doesn’t it mean that checking 10 locations is the same of checking one?

    1. Travis Downs says:

      No, no and no.

      No in the first sense, from an MLP point of view, because even if the hardware offers support for “10 concurrent requests”, 10 requests almost certainly still takes longer than 1 requests. Not 10x longer, but maybe 2x or 1.5x or whatever longer. Not all parts of the path to memory are perfectly parallel (for example a DIMM is only servicing one request at once, so at least at that point the requests will have to wait in line).

      Also there is variance in the time to service requests to DRAM, so the time to service 10 requests, even if perfectly parallel, will in general be longer than to service one request, because the slowest of 10 requests will be slower 9 times out of 10 than the slowest of 1 request. Just like if you mail out 10 letters in “parallel” the time until all letters arrive will be longer than if you just mailed out 1, even if the average time per letter was the same, due to variance.

      No in the second sense because the time to do a lookup in a hash table isn’t just the memory access. It’s a bunch of other work too, like calculating the hashes, checking the results of the probes, and combining the results of the probes into an answer. In some cases (cache resident lookups), the memory access may be a relatively minor part of the total cost.

      No in the third sense because the “wasted” MLP you want to take advantage of (i.e., the 8 wasted slots out of the 10 available), often isn’t wasted at all: code before or after your lookup might have been using it (including other hash lookups). So doing 10 lookups is only free in an MLP sense (and ignoring the issues raised above) if you can guarantee that there are no other accesses that would have overlapped with those implied by the lookup.

      What it does mean, however, is that if you find that latency is memory is 100 ns, then adding 8 additional lookups probably won’t cost anywhere near 8 * 100 = 800 ns, but much less.

      You may be misunderstanding the advantages of Cuckoo hash though: one reason it is fast is that the Cuckoo strategy means you don’t have to probe many locations to find an item. If you want to probe a lot of locations for some reason, much simpler old-school probe functions will let you do just that.

      1. No in the first sense, from an MLP point of view, because even if the hardware offers support for “10 concurrent requests”, 10 requests almost certainly still takes longer than 1 requests. Not 10x longer, but maybe 2x or 1.5x or whatever longer. Not all parts of the path to memory are perfectly parallel (for example a DIMM is only servicing one request at once, so at least at that point the requests will have to wait in line).

        Is this coming from an assumption or an actual measurement?

        No in the second sense because the time to do a lookup in a hash table isn’t just the memory access. It’s a bunch of other work too, like calculating the hashes, checking the results of the probes, and combining the results of the probes into an answer. In some cases (cache resident lookups), the memory access may be a relatively minor part of the total cost.

        My question is just relative to memory access at this point. BTW memory access is never relatively minor of the total cost from my experience.

        No in the third sense because the “wasted” MLP you want to take advantage of (i.e., the 8 wasted slots out of the 10 available), often isn’t wasted at all: code before or after your lookup might have been using it (including other hash lookups). So doing 10 lookups is only free in an MLP sense (and ignoring the issues raised above) if you can guarantee that there are no other accesses that would have overlapped with those implied by the lookup.

        I am only interested from a serial execution prospective right now if thats what you are talking about. Anyway here we are talking about MLP per core – “each core has an upper limit on the number of outstanding memory requests”.

        You may be misunderstanding the advantages of Cuckoo hash though.

        I do not think I am.

        one reason it is fast is that the Cuckoo strategy means you don’t have to probe many locations to find an item

        “in separate chaining, following the linked list of items in each slot requires the CPU to wait to read each node in order to discover the next pointer, whereas in cuckoo hashing, the addresses of both cells that a key maps to are computable before doing any memory accesses, so the CPU may speculatively fetch both addresses. We definitely hope processor vendors don’t take away our speculative execution!”

        https://dawn.cs.stanford.edu/2018/01/11/index-baselines/

        1. I quoted the wrong sentence of the blog post.

          “This difference surprised us, because the separate chaining table should be making fewer memory acceses per key than the cuckoo table on average, but one reason it can happen on modern hardware is indirect accesses”

          https://dawn.cs.stanford.edu/2018/01/11/index-baselines/

        2. Travis Downs says:

          Is this coming from an assumption or an actual measurement?

          Both. It’s just the way the hardware works, so that’s the “assumption” part – but it’s easy to measure and I have measured it many times. You can measure it with Daniel’s program that he links above: it outputs the runtime for the test and you can see that despite having “hardware” parallelism of 10 (in particular, that there are 10 line fill buffers on modern chips), you never get to 10x the performance, and as you approach 10 parallel accesses, the performance increase is substantially less than the maximum.

          Here’s a tangentially related post that upends the a similar idea that accesses to the same cache line are essentially free once you’ve brought the line in from memory. Even though that code is missing on every cache line, so would appear to be completely memory bound, the supposed “free” addition of the other elements ends up costing a lot despite the typical model of how this works indicating that they should cost essentially nothing.

          My question is just relative to memory access at this point. BTW memory access is never relatively minor of the total cost from my experience.

          I don’t think such a question can be answered directly, since you can’t just isolate memory access from the rest of the work – but yes, in general yeah the more time your program spends waiting on long latency misses, but not many such misses at one time, the more opportunity there is to “slip” additional misses into that window.

          Usually you try to do this by rearranging your existing misses so more of them can happen in parallel, increasing MLP in a “useful” way, not by adding extra misses that weren’t needed in the first place to “use up” the “unused MLP”. That is, high MLP is a means to higher performance, not and end into itself.

          I am only interested from a serial execution prospective right now if thats what you are talking about. Anyway here we are talking about MLP per core – “each core has an upper limit on the number of outstanding memory requests”.

          I am only talking about a single core. If I talk about “serial” or “parallel” I am only taking about within a single core, and those words refer to the inherent serial or parallel nature of the algorithm (e.g., ILP and MLP).

          For example, looking up 100 given keys in a hash is a “parallel” type of work (at least with respect to the hash map). Even though you write serial code to look up the keys one by one, the CPU will probably be able to do some the accesses in parallel, since there is no dependence from the result of one lookup to the input of the next. In this kind of scenario, trying to squeeze out a lot of MLP in a single lookup may not be productive, because you would have used much or all of that MLP anyways because the CPU will overlap the accesses for different lookups.

          A serial example is a bit more contrived (in this case), but would be something like starting with one key, looking it up, then using the result of the lookup the calculate a second key value, and so on, for 100 lookups. Here the CPU cannot overlap the accesses in separate lookups so any MLP you get has to come from accesses within a single lookup.

          “in separate chaining, following the linked list of items in each slot requires the CPU to wait to read each node in order to discover the next pointer, whereas in cuckoo hashing, the addresses of both cells that a key maps to are computable before doing any memory accesses, so the CPU may speculatively fetch both addresses. We definitely hope processor vendors don’t take away our speculative execution!

          Yes, I know. Chaining involves pointer chasing and is slow. Looking up all buckets in parallel is (often) good and fast hashes will do it, it’s not something unique to Cuckoo hashing.

          I’m still not following your suggestion though: you don’t want to start adding more lookups “just because” there is some unused MLP, unless it can speed things up, because they are not free. Cuckoo hashing is fast in part because an item appears in exactly one of two locations, and these locations can be checked in parallel. Changing that to three locations doesn’t obviously speed that up, and in fact would slow it down in general. I guess what it would let you do is increase the load factor slightly by delaying the rehash. That’s probably a minor effect – will it be able to pay off enough on average to overcome to cost of the third access? Maybe.

          There are plenty of cases though where additional parallel lookups are just an obvious big win. Any time you can change your accesses from serial to parallel (as above), you can get a giant leap, even if it looks like you are doing a lot more work. For example, in any kind of depth-first graph search algorithm, you want to try to mix in a bit of BFS by maintaining a “horizon” of active nodes that you expand in a round-robin fashion, rather than the default, clean approach of DFS. It complicates things, makes you execute more total instructions, and possibly more total loads, but ends up faster because you get more MLP.

        3. Jeff Plaisance says:

          you are not taking into account what is happening after the hash table lookup. by filling all of the available slots you are preventing the processor from executing further ahead in your program and finding more useful things to load. take for example doing these hash table lookups in a loop. if you have 10 load slots and do 10 loads per hash table lookup, you will wait for all 10 loads to return before starting the next iteration of the loop. if you only do 2 loads per hash table lookup, your processor may be able to continue executing instructions out of order into the next iteration of the loop, and you may be able to have the loads for 5 iterations of the loop executing simultaneously, which would be close to 5x faster.

          of course this requires the next hash table lookup to not depend on the result of the current hash table lookup. this is why the chained approach is so slow, pointer based data structures are a very poor match for modern processors since the address of the next load is dependent on the result of the current load.

          1. Absolutely true, but you are missing the advantages of having 10 possible slots instead of 2 only.
            Now, when inserting a key, I need to have 10 collisions instead of 2 before performing a cascading relocation which can end up in O(n) asymptotic complexity. Probabilistically speaking now it is quite rare to happen…

            Now, don’t get me wrong, I am not saying one solution is better than the other. The blanket is just too short, so it is all about finding the right trade of for your data.

            1. Antonio Mallia says:

              Sorry for the multiple replies. The website had a 500 internal server error while I was sending my answer.

            2. I am aware of the numerous 500 errors. As far as I can tell, they have to do with my server being overloaded. I have issued a technical request to my service provider and I will find a solution with them.

  • Benoit says:

    This is awesome! There is so little info or optimization tools for memory access (other than from you I feel). Thank you!

    I ran your code on two iOS devices (Release build, outside of debugger). Here are the results:

    A9X (iPad Pro 12.9″ 1st gen): 5

    A11 (iPhone X): 3, maybe 4

    Here is the last run on the iPhone X:

    1 1.130405
    2 0.600105
    3 0.560135
    4 0.540118
    5 0.550119
    6 0.500108
    7 0.520111
    8 0.430093
    9 0.460116
    10 0.450103
    11 0.340088
    12 0.320091
    13 0.310085
    14 0.300080

  • It sure looks like the latest iPhone has plenty of memory-level parallelism.

    1. Benoit says:

      Why? Do you think there is something wrong with the iPhone X run? Or that the 2018 iPhone are much better than the X?

      1. Sorry, by latest I meant “iPhone X”. It is clearly not the latest on the market though it is the latest in your tests.

        My benchmark goes up to 14 and stops there because I did not expect anything much to happen after 14. Clearly, the numbers you have suggest that it would be worth going beyond 14 lanes.

        In this sense that A11 processor defeats my benchmark.

        Something happens around 4 or 5 lanes that make a lane increase “bad”, but if you keep going, then more parallelism shows up.

        My benchmark is crude, so you should not believe what it says, you need to take a look at the numbers for yourself.

        1. Benoit says:

          Thanks. Yes, the story seems more complicated on Apple processors. Below are the numbers up to 29 lanes. It looks suspicious at 15 on the iPad, but it’s not a fluke: all runs are virtually the same.

          iPad Pro 1G iPhone X

          1 1.59747 1 1.140326
          2 0.845847 2 0.620127
          3 0.714404 3 0.560123
          4 0.609029 4 0.550142
          5 0.490035 5 0.550118
          6 0.493946 6 0.500108
          7 0.508048 7 0.520111
          8 0.469345 8 0.430103
          9 0.434727 9 0.460096
          10 0.43067 10 0.440094
          11 0.419054 11 0.340067
          12 0.498453 12 0.320072
          13 0.614335 13 0.310072
          14 0.581587 14 0.300075
          15 0.101332 15 0.300057
          16 0.104928 16 0.280063
          17 0.103828 17 0.270066
          18 0.104522 18 0.240057
          19 0.106745 19 0.200047
          20 0.104215 20 0.120028
          21 0.100903 21 0.080018
          22 0.100042 22 0.080017
          23 0.099389 23 0.080018
          24 0.105125 24 0.080018
          25 0.106232 25 0.080018
          26 0.104692 26 0.080016
          27 0.100769 27 0.080016
          28 0.103166 28 0.080018
          29 0.100798 29 0.080014

          1. What happens at 15 looks more like a limitation of my benchmark than a genuine measure.

            1. Benoit says:

              I reran on the iPad pro after adding a *15 factor to your howmanyhits formula. Same curve scaled by 15, same drop at 15 lanes. And there is no dumb cut-and-paste mistake in the code I added (in case you are, rightly, wondering ;-). Anyway, it occurs at 22 on the iPhone X…

              1. I reran on the iPad pro after adding a x15 factor to your howmanyhits formula.

                I doubt that it would change anything.

                And there is no dumb cut-and-paste mistake in the code I added

                I am not assuming that the mistake is yours.

                How do you run these tests on iOS? It has been some time since I ran shell programs on iOS. The last I did I did, I was using a rooted iPad (first generation), with a ssh server…

                Presumably, you could wrap the code inside some “App” and upload that to the device, but it seems like a lot of work.

                1. Benoit says:

                  I am not assuming that the mistake is yours.

                  You are too kind. That would have been my first thought 😉

              2. harold says:

                Does it still happen if you add a volatile sink like this? I had some trouble without that using MSVC, getting some suspiciously low timings for some but not all lane counts. Perhaps something like that is happening here too, but only starting at some high lane count?

                1. Travis Downs says:

                  What’s weird is that the effect happens on the iPad but not the iPhone. If it were a result of a compile quirk (which is entirely possible), you’d think it would be the same on both.

                  That is at least the usual case with AOT compilation, but I guess Apple has this thing now where you upload IR to the app store and then it is “specialized” for each device, so potentially that could be making difference decisions for the iPad vs the iPhone? Although again I guess the app store is probably not really involved here since this would have just been loaded on directly in “developer mode” or whatever that’s called.

                  Another thing that could cause it is “small” vs “big cores” – as I recall the A12 is a heterogeneous CPU with a mix of higher-speed latency oriented cores, and lower power cores for less demanding or “throughput” oriented tasks. Maybe the cutover is when the process transitions from one core type to another?

                  1. Maybe the cutover is when the process transitions from one core type to another?

                    Why would this happen systematically at the same number of lanes? Not saying that it is impossible… just that if it happens, it is a bit magical.

                    I am hoping to be able to reproducing Benoît’s results.

                    1. Travis Downs says:

                      I don’t know how it works exactly, but it wouldn’t be entirely surprising to me that the basic idea is that a process is profiled and after some period of time running a certain way is moved over to another core. If that time period is the same, I would expect it to move over at about the same time each time (and there is some room for error since we report the min time out of a few runs, so the don’t necessary expect to see one “half way” run at one point).

                      I don’t think it is necessarily the cause here, but certainly this heterogeneous core thing causes bimodal benchmark results in other cases.

                      One easy check would be to increase/decrease the number of measurements per lane count and see if the transition point moves.

  • Jeff Plaisance says:

    I ran a similar test in 2012 and found that the answer was 5 with hyper threading on and 10 with it off on ivy bridge. Did you have hyper threading turned on or off for your experiment?

    1. I always try to disable hyperthreading. It would be interesting to document the effect of hyperthreading on current processors.

    2. Travis Downs says:

      When you say “5 with hyper threading on” do you mean in a test that ran two threads on the same core, you saw a value of 5 for each thread, for a total of 10? That’s the result I would expect.

      If you mean something else, like you merely turned HT on, but then run a single-threaded test and see a value of only 5, then it is both surprising and interesting.

      1. Jeff Plaisance says:

        responding almost a year late, but what i found was that merely by turning HT on a single threaded test showed a value of 5. this was on either ivy bridge or sandy bridge, so it may be different on modern hardware.

        1. Travis Downs says:

          I don’t that’s true if you run the test with HT on but nothing else running on the sibling core. That is, all resources should be available to a single thread running on a core, when only a single thread is actually running, even if HT is enabled (the only exception being PMU counters, which do depend on whether HT in on in the BIOS).

          In particular for LFBs (fill buffers: the things that limit the per-core parallelism), I believe they are shared competitively when two threads are running: either core can use approximately as many as they need, but of course the limit is still 10. So one thread should be able to use all 10.

          Perhaps what happened is that something else ended up running at the same time, on the same core, as your test? Was it very reproducible?

          1. Jeff Plaisance says:

            It was highly reproducible. I never saw a result indicating the existance of more than 5 load fill buffers in a single threaded test with hyper threading on. I also tested pulling all the dimms except one to see if that had any effect and from what I remember the impact was minimal. I don’t remember how many ranks were on the dimm but I can’t check now since that computer went into the trash a long time ago.

            1. Travis Downs says:

              That is pretty interesting. Maybe those older uarches really did have boot-time-fixed partitioned LFBs.

              Daniel, do you have any machine older than HSW? Such machines are still available in the cloud, although probably not for that much longer.

              1. Travis: yes, I have older boxes still working and accessible but often the software is also old. Get in touch by email.

            2. Travis Downs says:

              That’s interesting, perhaps these resources were boot-time partitioned on older uarches.

              Daniel do you have any Intel box older than HSW?

  • Michel Lemay says:

    I’m wondering.. could some of the inconsistencies seen above be related to the processor pipeline depth? It’s been a while since I studied that but, from what I recall, memory load were issued in earlier stages. So it looks like you fill the pipeline with load instructions and hit a wall when you have more lanes than space in the pipeline. Does that still makes sense?

    1. I think you are correct… but it does not explain the ios numbers where things just speed up insanely beyond 14 lanes (my own code stops at 14 lanes).

  • you are not taking into account what is happening after the hash table lookup. by filling all of the available slots you are preventing the processor from executing further ahead in your program and finding more useful things to load. take for example doing these hash table lookups in a loop. if you have 10 load slots and do 10 loads per hash table lookup, you will wait for all 10 loads to return before starting the next iteration of the loop. if you only do 2 loads per hash table lookup, your processor may be able to continue executing instructions out of order into the next iteration of the loop, and you may be able to have the loads for 5 iterations of the loop executing simultaneously, which would be close to 5x faster.

    Absolutely true, but you are missing the advantages of having 10 possible slots instead of 2 only.
    Now, when inserting a key, I need to have 10 collisions instead of 2 before performing a cascading relocation which can end up in O(n) asymptotic complexity. Probabilistically speaking now it is quite rare to happen…

    Now, don’t get me wrong, I am not saying one solution is better than the other. The blanket is just too short, so it is all about finding the right trade of for your data.

  • Luiceur says:

    I am amazed! You did this in 15min!

    1. The initial code was indeed written in maybe 15 minutes. I think it is fair to say that anyone who knows C well enough could have done so just as fast or faster.

      1. Luiceur says:

        Why C++ instead of C? if I may ask

        1. There is no particular reason in this instance. It would not be much effort to convert the code to pure C. But there would also not be a lot of benefit to it.

  • anon says:

    Your MLP measurement code is likely limited by how big the scheduler window is, not by how many misses the hardware can track.

    For each load miss, your code (next address computation) has 7~8 dependent op piled behind it, which likely takes up space in the scheduling window.

    Once we have 7~8 misses, we have filled up the scheduler window and machine will not be able to observe the memory access from next lane, even when the address for that lane is ready.

  • Travis Downs says:

    That’s why we introduced the “naked” version of the test, which uses N parallel pointer chases that generally has zero additional ops per load. For example, the loop for the naked_access_8 function for 8 parallel chains looks like:

    ff0: add rdx,0x1
    ff4: mov rcx,QWORD PTR [rdi+rcx*8]
    ff8: mov r8,QWORD PTR [rdi+r8*8]
    ffc: cmp rsi,rdx
    fff: mov r9,QWORD PTR [rdi+r9*8]
    1003: mov r10,QWORD PTR [rdi+r10*8]
    1007: mov r11,QWORD PTR [rdi+r11*8]
    100b: mov rax,QWORD PTR [rdi+rax*8]
    100f: mov rbp,QWORD PTR [rdi+rbp*8]
    1013: mov rbx,QWORD PTR [rdi+rbx*8]
    1017: jne ff0 <naked_access_8(unsigned long*, unsigned long)+0x40>

    It’s just the 8 load instructions, plus an increment, compare and jump (in principle, the compiler could have removed the cmp by counting down to zero, but my version of gcc isn’t smart enough).

    The later results Daniel have reported, such as in this post are all using this version of the test, as far as I know. I’m not sure if the results the post we are commenting are using the “naked” or original hash-based test, but we’ve done both and they are more or less consistent for those platforms. I think Daniel could confirm where the numbers in the table above come from.

    The pointer chasing test has the downside that it starts to produce ugly code once you run out of registers.

  • Heechul Yun says:

    Intel core’s MLP is 10 (in Nehalem/Haswell and likely in newer architectures as well) and ARM’s MLP is 6 (at least in Cortex-A15/A17). Please check this paper (Appendix A; Figure 11) and the code.

    1. Travis Downs says:

      Thanks for the paper, and especially for making your code available. The test you used to determine MLP is the same one in spirit to the one that we are using now, and we get similar results [1]. Interestingly, on Skylake, performance increases until an MLP factor of 12, rather than 10 – although the total increase in performance is still around 10x, so it is not clear to me if there are 10, 11 or 12 MSHRs on Skylake. An uncached (NT) write test indicates there are 12 write-combining buffers on Skylake, versus 10 on Haswell, and under the assumption that WC buffers and MSHRs are using the same underlying hardware it points towards 12 MSHRs on Skylake.

      One thing I learned from the paper that is surprising to me was the idea that MSHR exhaustion blocks further access to the cache, including hits:

      Moreover, if there are no remaining MSHRs, further accesses to
      the cache—both hits and misses—are blocked until free MSHRs become
      available [2], because whether a cache access is hit or miss
      is not known at the time of the access [33]. In other words, cache
      hit requests can be delayed if all MSHRs are used up.

      It is not entirely obvious to me why this should be the case, especially for the L1 MSHRs – why could they not service a hit even if all MSHRs are full? I checked the references [2] and [33], and both resources mention this behavior but not why this is the case. Do you have any idea?

      One reason could be that each level of the cache wants to “hand off” the request to the next level, but if there are no MSHRs available at the next level there is essentially nowhere to store the request if it misses, so it is easier just to block all incoming requests if there are no MSHRs. I’m not sure that applies to the L1 though: since you already have a replay mechanism at the scheduler level which will retry rejected requests.

      [1] One difference is that Daniel’s code uses generated functions for each MLP level tested, with local variables for each pointer chasing chain like:

      idx1 = array[idx1];
      idx2 = array[idx2];
      ...

      Rather than using array elements like idx[1], idx[2], and a fall-thru switch that covers all cases. This avoids any extra memory reads or write under very the basic assumption that the compiler enregisters local variables. Using array elements could also be optimized out, but it requires a much more sophisticated compiler that can scalarize the array accesses through complex control slow like the switch statement.

      1. Heechul Yun says:

        It is not entirely obvious to me why this should be the case,
        especially for the L1 MSHRs – why could they not service a hit even if
        all MSHRs are full? I checked the references [2] and [33], and both
        resources mention this behavior but not why this is the case. Do you
        have any idea?

        My understanding is that probing the cache—determining whether an access is a cache hit or miss—takes time (i.e., generally cannot be done in a single cycle, especially for L2/L3).
        You may also find this additional lab material (page 2) from Prof.Onur Mutlu at CMU (now at ETH) interesting.

        1. Travis Downs says:

          Thanks. The lab material just repeats the idea that the cache cannot be accessed at all if its MSHRs are full – it doesn’t give any reasoning. It is also not clear if in this case it is just a simplification or actually intends to model real hardware (see the preceding item, for example, which is clearly just a simplification).

          It isn’t clear to me what the significance of “taking time” means. Certainly the process of probing the cache takes time, as seen by the core itself. Let’s assume it takes more than one cycle. Why does that mean that hits cannot be satisfied while the MSHRs are full?