, 14 min read
Greater speed in memory-bound graph algorithms with just straight C code
17 thoughts on “Greater speed in memory-bound graph algorithms with just straight C code”
, 14 min read
17 thoughts on “Greater speed in memory-bound graph algorithms with just straight C code”
This is getting more interesting. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?
Also should this also work in Java? I wrote this Java code quickly that does a 7 step BFS traversal starting from a random source doing it 1-by-1, or by your 8-by-8 trick. The 8-by-8 trick doesn’t seem to help on my Mac OS. I’m using Java 8. I didn’t check your code in detail, I just wrote it based on the pseudocode.
The link got lost for my code: https://github.com/semihsalihoglu-uw/misc/blob/master/java-bfs-opt/ScratchPad.java
Exactly. The idea is that you can replace a software prefetch by a data access that is “ahead of time”. This data access will get stuck, but the processor will work for you, trying to fill it in.
I should warn you that the code above is not optimal. It is a hack to prove the concept.
Yes.
I chose C because it will allow me to do things that are not possible in Java. (I am not nearly done!) But what I just did is language independent.
I’ll look at your Java code. (Next week?)
Your data access is not really “ahead-of-time”, but out-of-order execution kicks in as the accesses in the batch are independent.
There is prior work doing very similar things with prefetching as out-of-order execution has its limits (window size). But I’m not aware of anybody doing it for graph traversals.
All of them utilize memory parallelism to reduce or hide memory latency such that the cache miss penalty is less severe.
group prefetching
https://dl.acm.org/citation.cfm?id=1272747
Asynchronous memory access chaining
http://www.vldb.org/pvldb/vol9/p252-kocberber.pdf
Interleaving with Coroutines
http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
I would be surprised if techniques like this to include MLP on graph algorithms aren’t widely used in the industry. I have looked at very few graph traversal algorithms, but one I did look at a long time ago: the mark part of the JDK’s mark & sweep garbage collector definitely used a strategy to ensure multiple requests were in flight at once: the nodes to visit were kept in a queue, and prefetch requests were issued to the N elements at the head of the queue, to try to keep about N requests in flight at once.
I remember reading it way back then and thinking “that’s clever!”. If you were really sure you were almost always missing to DRAM, you might even apply other tricks such as sorting the nodes in the queue so that they have a more linear access pattern, or visiting them in an order that was otherwise cache or memory controller friendly.
It depends what you mean by “the industry”. There is related published work in the field of garbage collection, and I expect state-of-the-art garbage collectors to optimize accordingly.
By “industry” I just mean the collection of people writing software for use, whether commercial, open source, internal use, whatever. More or less as opposed to “in the literature” and/or academia.
Above there was some discussion of whether this idea had been explored, and a comment that its applicability to graphs perhaps hadn’t been explored. My claim then is that this has almost certainly been repeatedly explored “in the industry” (i.e., software about which no paper is written, or at least no paper exploring this issue) since one of the few graph algorithms I looked at long ago already used the trick of explicitly managing a queue of future memory accesses and prefetches to get MLP.
Perhaps that “trick” originated in a GC paper, but in my experience that is not the normal direction of flow for such O(1) (i.e., not changing the complexity) tricks.
I’m not surprised. Algorithms textbooks assume that the difference between a machine with 220 instructions in flight at a time and 1 instruction in flight is O(1), which of course it is. 🙂 However, there is a huge speedup that can be had by these sort of transformations if you can break dependencies. I’m working on a Random Forest implementation off-and-on and all the first-order issues I’m having with it are identical to what you discuss, although a binary tree is way easier to deal with than an arbitrary graph and more amenable to optimizations. It is interesting that there is way more work on naive parallelism of this sort of thing (“let’s chuck 8 cores at it”) than there is extracting the parallelism latent in one core. Of course, if you get too good at extracting per-core parallelism you will wind up potentially saturating uncore resources like L3 or the memory bus. But I have plenty of algorithms whose naive versions really do nothing but sit around and wait for L2 all day, and max-ing out L2 utilization is just fine as far as core scaling goes.
The nice thing about an explicit prefetch version is that I, the reader, look at it and immediately know “This is designed to target a specific memory hierarchy”. The groups-of-8 version certainly can have a comment which describes why it is that way, but in my experience the more likely thing you’ll find is a piece of code which feels like it’s more complicated than it needs to be, without any explanation of how or why.
I don’t disagree but the approach I have taken here is probably both suboptimal and overly complicated. I think you can get both better speed and simpler code with some effort.
I’m thinking prefetch and gather are both potentially interesting optimization areas to look at, even though I have never had a particularly fun time with gather. Both address the problem of running out of places to stash data items fetched in a wide pipeline (in radically different ways).
I’d love to see results with less regular data, when the degree varies.
I’d like to move this to real data, but there is more at the implementation side of things that could be done.
VPP has been doing this for 15 years. Well worth a look for some brilliant performant patterns
Wiki.fd.io
Well worth a look for some brilliant performant patterns
Can you point to some of these patterns?
Interesting!
Some years ago, we’ve done tests that somehow may be in line with this.
We were trying to speed-up diameter algorithms (which involve a lot of breadth first traversal) and find out that just renumbering the graph with the order of a bfs can have huge impact. Our guess was that it increases locality. Unfortunately we had other projects more important …
Will this work when using openmp ?