Thanks Daniel for the post! Quite interesting to see this potential. An immediate thing to see is what happens on a non-random graph. The benefits might go down. The higher the randomness of connections, the more prefetching should help probably, because the cache locality of the regular (non-prefetching) traversal should be the worst when the connections are totally random. I should play around with your code myself.
There are several algorithmic and data structure-related optimizations for shortest path queries to speed up the vanilla BFS-based solution you started with. Most of the algorithmic and data-structure related optimizations are trying to address the same problem though: often batch graph computations are memory-bound. For example, there are smart ways of assigning node-IDS (e.g., according to a hilbert curve), the compressed sparse row format of storing the edges, or partitioning the neighbors of each node so that each partition fits in the lowest-level CPU caches. These optimizations do not change the total number of edges BFS will read but instead try to increase the CPU cache hit rate when reading the edges from the memory. There is also several algorithmic optimizations for the single-pair shortest-paths problem you took, i.e., when the query has a source and a destination. One well-understood one is to do a bidirectional BFS, one from the source and one from the vertex. This one for example directly decreases the number of edges BFSÂ reads.
I see these optimizations in publications and integrated into many graph processing software. However, I don’t think prefetching-like processor-level optimizations are as well studied (nor integrated into systems I study), so work here would be quite interesting. I’m curious which low-level optimizations are available that can enhance other existing optimizations.
An immediate thing to see is what happens on a non-random graph. The benefits might go down.
True. But I also expect that with larger graphs, larger gains are possible. Of course, the challenges also increase.
The higher the randomness of connections, the more prefetching should help probably, because the cache locality of the regular (non-prefetching) traversal should be the worst when the connections are totally random. I should play around with your code myself.
One definitively wants to use real graphs.
There is also several algorithmic optimizations for the single-pair shortest-paths problem you took, i.e., when the query has a source and a destination. One well-understood one is to do a bidirectional BFS, one from the source and one from the vertex. This one for example directly decreases the number of edges BFS reads.
I remembered this approach from our chat, but I deliberately went for something naive.
However, I don’t think prefetching-like processor-level optimizations are as well studied (nor integrated into systems I study), so work here would be quite interesting. I’m curious which low-level optimizations are available that can enhance other existing optimizations.
There is probably quite a bit of optimization possible above and beyond purely algorithmic gains. But it is probably not as simple as spraying prefetch instructions in the code (though, if done right, it might be better than nothing).
Frank Tetzelsays:
One has to be very careful when using the word memory bound in a graph context as memory bound has two very different aspects. There is bandwidth bound and latency bound. Graph traversals like BFS are latency bound that is why prefetching helps. On the other hand, page rank is usually bandwidth bound.
I actually doubt that you can rewrite a graph traversal in such a way that current hardware prefetchers can help. They are optimized for sequential and strided accesses. The accesses of graph traversals are too irregular. As already mentioned CSR and node reordering can improve data locality.
This is an interesting point. Whether you’d be latency or bandwidth bound, even in BFS, will depend on the implementation, specifically your parallelism. Say you have ten threads (say on a single core machine) running a parallel BFS from a single source to traverse a large graph, I would expect you’ll be bandwidth bound. A single threaded implementation might be less bandwidth bound, so prefetchers here might help more. So, yes, if we were to parallelize Daniel’s code, the benefits of prefetching will likely go down.
So, yes, if we were to parallelize Daniel’s code, the benefits of prefetching will likely go down.
… with the caveat stated in my blog post that memory access is a shared resource on multicore systems…
Travis Downssays:
Even the concepts of “latency bound” and “bandwidth bound” are a bit fuzzy when it comes to modern hardware. The DRAM configuration will have a certain maximum bandwidth which is a simple product of the data transfer size and rate (e.g., DDR4-2400) and number of memory channels. Maybe this is 50 GB/s on your system.
An algorithm that on given hardware would otherwise achieve more than 50 GB/s could definitely be called DRAM bandwith limited: you are exploiting the RAM bandwidth to its limits. In principle, a reduction in memory latency wouldn’t help at all (but a RAM bandwidth increase would).
At the other end of the spectrum, you have a classic “pointer chasing” memory latency bound algorithm such as iterating through a linked list (assume the nodes are spread around randomly in memory so prefetching doesn’t let you cheat the latency). This always has an MLP factor of 1 (exactly one outstanding request at any time). The performance of this algorithm is entirely dependent on the latency: if you cut the latency in half, the runtime is cut in half as well.
The latency bound algorithms as above are usually easy to identify statically by examining the data dependency graph: they are the ones where the address for any memory access depends on the result of the prior access (or as a relaxation some prior access). You aren’t really restricted to MLP 1 algorithms either: consider simultaneously iterating from both ends of a doubly-linked list towards the center: this has an MLP of 2, but is still in some sense entirely latency bound: if you halve the latency you again halve the runtime.
We might then intuitively define bandwidth bound algorithms as those were memory accesses aren’t serially dependent at all, i.e., “infinite MLP”. Essentially, that the memory addresses are calculable without involving prior memory accesses (i.e., in the data dependence graph, the memory access nodes are all accessible without passing through any other memory access nodes). A simple example is summing all the elements in an array, or a vector dot-product or whatever: all the addresses to load are calculable without any dependence on earlier loads.
Does the above definition of bandwidth limited algorithms line up with our earlier hardware based definition (an algorithm that is capable of saturating the DRAM interface)? Unfortunately not, at least for single-threaded algorithms on most modern x86 chips (and probably most other high performance chips, but I’m not familiar with the details)!
Modern x86 chips have a limited number of buffers between the core and the memory subsystem. On Intel these are called (Line) Fill Buffers, in the literature they are more generically called MHSR (miss handling status registers). An MLP 1 algorithm will only ever use one of these at a time. An “infinite MLP” algorithm will probably fill all of them. There are a limited number of these buffers. The key observation is that on most chips, even if all of these buffers are filled, the DRAM bandwidth cannot be reached. Intel chips have 10 such buffers, so on a system with 90 ns memory latency, the maximum achievable bandwidth (ignoring prefetching) is 64 bytes/line * 10 LFBs / (90 ns/line) = ~7.1 GB/s. Yet modern CPUs have DRAM bandwidths of ~20-30 GB/s (dual channel) or 50-60 GB/s (quad channel). So it would take several copies of the above core working in parallel to hit the DRAM bandwidth limit.
So I would propose something like “concurrency limited” for algorithms which are limited by the number of outstanding requests at the core level, rather than the memory bandwidth.
One might argue that “concurrency limited” versus “bandwidth limited” is a distinction without a difference, but I think it matters. In particular, it implies that the maximum per-core bandwidth is actually directly dependent on the latency: if you cut the latency in half, your runtime is cut in half even for the apparently “bandwidth limited” algorithms: since the occupancy time of each request in the fill buffers is cut in half and so you get twice as much work done. That’s very different than a truly DRAM bandwidth limited algorithm, where memory latency barely matters.
It also matters because it contradicts advice you’ll often see: that parallelizing “memory bandwidth bound” algorithms by running them on multiple cores doesn’t work since memory bandwidth is a shared resource. Well, the fill buffers are not shared resources, so concurrency-limited algorithms do scale when you add more cores, since you get more fill buffers and hence more parallel requests. Of course this scaling stops when you hit the bandwidth wall: then the parallel version of the algorithm becomes bandwidth limited (the MLP factor for each core stays the same at 10, but the observed latency, aka occupancy time increases so that the DRAM bandwith limit is respected).
So I think the most useful way to characterize an algorithm, independently of hardware is to evaluate its MLP factor (maximum theoretical MLP).
Then to apply this to specific hardware, and you determine the HMLP (hardware MLP factor – essentially the number of fill buffers) and then the actual MLP will be the lower of the algorithm MLP and the HMLP. In the case the algorithm MLP is lower than the HMLP we could call the algorithm “latency bound”. In the case that the algorithm MLP is larger than the HMLP, we then also compare the achieved core bandwidth at maximum HMLP (e.g., the ~7.1 GB/s figure calculated above) to the DRAM memory bandwidth figure. If the HMLP-implied bandwidth is lower (as it is on most x86 chips), we could call the algorithm “concurrency limited”, if it is larger we could call the algorithm “RAM bandwidth limited”. Note that this evaluation is hardware dependent: any algorithm with an MLP greater than 1 could fall into any of the three categories, depending on the hardware!
This gives the following intuitive “litmus tests” for the three categorizations, based on the effect of three hypothetical hardware changes (where “Helps!” means a direct proportional effect on runtime):
Decreasing memory latency.
Increasing fill buffer count.
Increasing RAM bandwidth.
Latency Bound
Helps!
Does nothing
Does nothing
Concurrency Bound
Helps!
Helps!
Does nothing
RAM Bandwidth Bound
Does nothing
Does nothing
Helps!
Of course, in practice you can never change (2) without a uarch change, and you can only limited changes to change (1) or (3) – but this is more a way to think about this stuff than a tuning guide.
Which brings me back to the point I originally wanted to make, but which took a long time to set up the prerequisites! It seems to be that BFS is not likely to be latency limited, unless the average out degree of your graph is very small (close to 1). On most graphs, BFS should be concurrency-limited: as long as the current horizon is at least as large as the HMLP you should get to full concurrency and hence be no more latency-limited than another single core algorithm.
Something like DFS seems more likely to be latency limited, since it is essentially a series of pointer-chasing like serially dependent loads (of course, a lot depends on the graph shape and especially on how the brach predictor ends up working on your graph).
I almost entirely ignored hardware prefetching in the above. I don’t want to cover it fully because this is long enough, but briefly: prefetching complicates the above but doesn’t change the core conclusions. Hardware prefetching can lead to an apparent increase in the number of fill buffers, but they are still limited. You basically then end up with two different types of “concurrency limited” algorithms: prefetch friendly and prefetch unfriendly: you can use the same basic framework to analyze them, but with different HMLP values.
Travis Downssays:
Those “Helps!” lists showed as numbered in the post preview, but not when I actually posted. Anyways, they line up 1,2,3 with the 3 litmus tests noted above (decreasing latency, increasing fill buffers, increasing RAM BW).
Ting Yesays:
How many buffers hardware prefetcher has?
Yongkeesays:
I would agree in general BFS is bounded by latency while PR may be bounded by bandwidth, particularly as compared to BFS. However, it’s often very hard to just simply say like that until it’s mapped to real hardware to see how locality comes into play.
For example [REF], Pagerank on Intel SKX and Intel KNL exhibits very different behavior with a certain size of graph. If you take a look at Fig(3(a) and (b), would you still conclude that PR is memory bound even on KNL?
I think we should be careful even when we say it’s memory bandwidth bounded, like Travis commented below.
They are optimized for sequential and strided accesses. The accesses of graph traversals are too irregular.
Based in my understanding, this is the opposite of what prefetch instructions exist for – CPUs are quite good at predicting sequential and strided accesses already, but aren’t (and in many cases, can’t be) good at predicting… unpredictable fetches. The reason CPUs include prefetch instructions is so that the software developer can give a hint to the CPU that something will be needed shortly, so it should opportunistically start loading it. BFS is exactly one of the data access patterns where CPUs will struggle to predict future loads, and happens to be quite common in garbage collectors.
Thanks Daniel for the post! Quite interesting to see this potential. An immediate thing to see is what happens on a non-random graph. The benefits might go down. The higher the randomness of connections, the more prefetching should help probably, because the cache locality of the regular (non-prefetching) traversal should be the worst when the connections are totally random. I should play around with your code myself.
There are several algorithmic and data structure-related optimizations for shortest path queries to speed up the vanilla BFS-based solution you started with. Most of the algorithmic and data-structure related optimizations are trying to address the same problem though: often batch graph computations are memory-bound. For example, there are smart ways of assigning node-IDS (e.g., according to a hilbert curve), the compressed sparse row format of storing the edges, or partitioning the neighbors of each node so that each partition fits in the lowest-level CPU caches. These optimizations do not change the total number of edges BFS will read but instead try to increase the CPU cache hit rate when reading the edges from the memory. There is also several algorithmic optimizations for the single-pair shortest-paths problem you took, i.e., when the query has a source and a destination. One well-understood one is to do a bidirectional BFS, one from the source and one from the vertex. This one for example directly decreases the number of edges BFSÂ reads.
I see these optimizations in publications and integrated into many graph processing software. However, I don’t think prefetching-like processor-level optimizations are as well studied (nor integrated into systems I study), so work here would be quite interesting. I’m curious which low-level optimizations are available that can enhance other existing optimizations.
Semih
True. But I also expect that with larger graphs, larger gains are possible. Of course, the challenges also increase.
One definitively wants to use real graphs.
I remembered this approach from our chat, but I deliberately went for something naive.
There is probably quite a bit of optimization possible above and beyond purely algorithmic gains. But it is probably not as simple as spraying prefetch instructions in the code (though, if done right, it might be better than nothing).
One has to be very careful when using the word memory bound in a graph context as memory bound has two very different aspects. There is bandwidth bound and latency bound. Graph traversals like BFS are latency bound that is why prefetching helps. On the other hand, page rank is usually bandwidth bound.
I actually doubt that you can rewrite a graph traversal in such a way that current hardware prefetchers can help. They are optimized for sequential and strided accesses. The accesses of graph traversals are too irregular. As already mentioned CSR and node reordering can improve data locality.
Other proposals add a graph prefetcher in hardware.
http://www-dyn.cl.cam.ac.uk/~tmj32/wordpress/hardware-graph-prefetchers/
This is an interesting point. Whether you’d be latency or bandwidth bound, even in BFS, will depend on the implementation, specifically your parallelism. Say you have ten threads (say on a single core machine) running a parallel BFS from a single source to traverse a large graph, I would expect you’ll be bandwidth bound. A single threaded implementation might be less bandwidth bound, so prefetchers here might help more. So, yes, if we were to parallelize Daniel’s code, the benefits of prefetching will likely go down.
… with the caveat stated in my blog post that memory access is a shared resource on multicore systems…
Even the concepts of “latency bound” and “bandwidth bound” are a bit fuzzy when it comes to modern hardware. The DRAM configuration will have a certain maximum bandwidth which is a simple product of the data transfer size and rate (e.g., DDR4-2400) and number of memory channels. Maybe this is 50 GB/s on your system.
An algorithm that on given hardware would otherwise achieve more than 50 GB/s could definitely be called DRAM bandwith limited: you are exploiting the RAM bandwidth to its limits. In principle, a reduction in memory latency wouldn’t help at all (but a RAM bandwidth increase would).
At the other end of the spectrum, you have a classic “pointer chasing” memory latency bound algorithm such as iterating through a linked list (assume the nodes are spread around randomly in memory so prefetching doesn’t let you cheat the latency). This always has an MLP factor of 1 (exactly one outstanding request at any time). The performance of this algorithm is entirely dependent on the latency: if you cut the latency in half, the runtime is cut in half as well.
The latency bound algorithms as above are usually easy to identify statically by examining the data dependency graph: they are the ones where the address for any memory access depends on the result of the prior access (or as a relaxation some prior access). You aren’t really restricted to MLP 1 algorithms either: consider simultaneously iterating from both ends of a doubly-linked list towards the center: this has an MLP of 2, but is still in some sense entirely latency bound: if you halve the latency you again halve the runtime.
We might then intuitively define bandwidth bound algorithms as those were memory accesses aren’t serially dependent at all, i.e., “infinite MLP”. Essentially, that the memory addresses are calculable without involving prior memory accesses (i.e., in the data dependence graph, the memory access nodes are all accessible without passing through any other memory access nodes). A simple example is summing all the elements in an array, or a vector dot-product or whatever: all the addresses to load are calculable without any dependence on earlier loads.
Does the above definition of bandwidth limited algorithms line up with our earlier hardware based definition (an algorithm that is capable of saturating the DRAM interface)? Unfortunately not, at least for single-threaded algorithms on most modern x86 chips (and probably most other high performance chips, but I’m not familiar with the details)!
Modern x86 chips have a limited number of buffers between the core and the memory subsystem. On Intel these are called (Line) Fill Buffers, in the literature they are more generically called MHSR (miss handling status registers). An MLP 1 algorithm will only ever use one of these at a time. An “infinite MLP” algorithm will probably fill all of them. There are a limited number of these buffers. The key observation is that on most chips, even if all of these buffers are filled, the DRAM bandwidth cannot be reached. Intel chips have 10 such buffers, so on a system with 90 ns memory latency, the maximum achievable bandwidth (ignoring prefetching) is 64 bytes/line * 10 LFBs / (90 ns/line) = ~7.1 GB/s. Yet modern CPUs have DRAM bandwidths of ~20-30 GB/s (dual channel) or 50-60 GB/s (quad channel). So it would take several copies of the above core working in parallel to hit the DRAM bandwidth limit.
So I would propose something like “concurrency limited” for algorithms which are limited by the number of outstanding requests at the core level, rather than the memory bandwidth.
One might argue that “concurrency limited” versus “bandwidth limited” is a distinction without a difference, but I think it matters. In particular, it implies that the maximum per-core bandwidth is actually directly dependent on the latency: if you cut the latency in half, your runtime is cut in half even for the apparently “bandwidth limited” algorithms: since the occupancy time of each request in the fill buffers is cut in half and so you get twice as much work done. That’s very different than a truly DRAM bandwidth limited algorithm, where memory latency barely matters.
It also matters because it contradicts advice you’ll often see: that parallelizing “memory bandwidth bound” algorithms by running them on multiple cores doesn’t work since memory bandwidth is a shared resource. Well, the fill buffers are not shared resources, so concurrency-limited algorithms do scale when you add more cores, since you get more fill buffers and hence more parallel requests. Of course this scaling stops when you hit the bandwidth wall: then the parallel version of the algorithm becomes bandwidth limited (the MLP factor for each core stays the same at 10, but the observed latency, aka occupancy time increases so that the DRAM bandwith limit is respected).
So I think the most useful way to characterize an algorithm, independently of hardware is to evaluate its MLP factor (maximum theoretical MLP).
Then to apply this to specific hardware, and you determine the HMLP (hardware MLP factor – essentially the number of fill buffers) and then the actual MLP will be the lower of the algorithm MLP and the HMLP. In the case the algorithm MLP is lower than the HMLP we could call the algorithm “latency bound”. In the case that the algorithm MLP is larger than the HMLP, we then also compare the achieved core bandwidth at maximum HMLP (e.g., the ~7.1 GB/s figure calculated above) to the DRAM memory bandwidth figure. If the HMLP-implied bandwidth is lower (as it is on most x86 chips), we could call the algorithm “concurrency limited”, if it is larger we could call the algorithm “RAM bandwidth limited”. Note that this evaluation is hardware dependent: any algorithm with an MLP greater than 1 could fall into any of the three categories, depending on the hardware!
This gives the following intuitive “litmus tests” for the three categorizations, based on the effect of three hypothetical hardware changes (where “Helps!” means a direct proportional effect on runtime):
Decreasing memory latency.
Increasing fill buffer count.
Increasing RAM bandwidth.
Latency Bound
Helps!
Does nothing
Does nothing
Concurrency Bound
Helps!
Helps!
Does nothing
RAM Bandwidth Bound
Does nothing
Does nothing
Helps!
Of course, in practice you can never change (2) without a uarch change, and you can only limited changes to change (1) or (3) – but this is more a way to think about this stuff than a tuning guide.
Which brings me back to the point I originally wanted to make, but which took a long time to set up the prerequisites! It seems to be that BFS is not likely to be latency limited, unless the average out degree of your graph is very small (close to 1). On most graphs, BFS should be concurrency-limited: as long as the current horizon is at least as large as the HMLP you should get to full concurrency and hence be no more latency-limited than another single core algorithm.
Something like DFS seems more likely to be latency limited, since it is essentially a series of pointer-chasing like serially dependent loads (of course, a lot depends on the graph shape and especially on how the brach predictor ends up working on your graph).
I almost entirely ignored hardware prefetching in the above. I don’t want to cover it fully because this is long enough, but briefly: prefetching complicates the above but doesn’t change the core conclusions. Hardware prefetching can lead to an apparent increase in the number of fill buffers, but they are still limited. You basically then end up with two different types of “concurrency limited” algorithms: prefetch friendly and prefetch unfriendly: you can use the same basic framework to analyze them, but with different HMLP values.
Those “Helps!” lists showed as numbered in the post preview, but not when I actually posted. Anyways, they line up 1,2,3 with the 3 litmus tests noted above (decreasing latency, increasing fill buffers, increasing RAM BW).
How many buffers hardware prefetcher has?
I would agree in general BFS is bounded by latency while PR may be bounded by bandwidth, particularly as compared to BFS. However, it’s often very hard to just simply say like that until it’s mapped to real hardware to see how locality comes into play.
For example [REF], Pagerank on Intel SKX and Intel KNL exhibits very different behavior with a certain size of graph. If you take a look at Fig(3(a) and (b), would you still conclude that PR is memory bound even on KNL?
I think we should be careful even when we say it’s memory bandwidth bounded, like Travis commented below.
[REF] http://heirman.net/papers/sc2018.pdf
Based in my understanding, this is the opposite of what prefetch instructions exist for – CPUs are quite good at predicting sequential and strided accesses already, but aren’t (and in many cases, can’t be) good at predicting… unpredictable fetches. The reason CPUs include prefetch instructions is so that the software developer can give a hint to the CPU that something will be needed shortly, so it should opportunistically start loading it. BFS is exactly one of the data access patterns where CPUs will struggle to predict future loads, and happens to be quite common in garbage collectors.
GC is a similar problem, Effective Prefetch for Mark-Sweep Garbage Collection (http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf)
Thanks for the reference.
There is prior work: Software Prefetching for Mark-Sweep Garbage Collection (2004).