Daniel Lemire's blog

, 14 min read

Greater speed in memory-bound graph algorithms with just straight C code

17 thoughts on “Greater speed in memory-bound graph algorithms with just straight C code”

  1. Semih Salihoglu says:

    This is getting more interesting. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?

    Also should this also work in Java? I wrote this Java code quickly that does a 7 step BFS traversal starting from a random source doing it 1-by-1, or by your 8-by-8 trick. The 8-by-8 trick doesn’t seem to help on my Mac OS. I’m using Java 8. I didn’t check your code in detail, I just wrote it based on the pseudocode.

    1. Semih Salihoglu says:
    2. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?

      Exactly. The idea is that you can replace a software prefetch by a data access that is “ahead of time”. This data access will get stuck, but the processor will work for you, trying to fill it in.

      I should warn you that the code above is not optimal. It is a hack to prove the concept.

      Also should this also work in Java?


      I chose C because it will allow me to do things that are not possible in Java. (I am not nearly done!) But what I just did is language independent.

      I’ll look at your Java code. (Next week?)

      1. Frank Tetzel says:

        Your data access is not really “ahead-of-time”, but out-of-order execution kicks in as the accesses in the batch are independent.

        There is prior work doing very similar things with prefetching as out-of-order execution has its limits (window size). But I’m not aware of anybody doing it for graph traversals.
        All of them utilize memory parallelism to reduce or hide memory latency such that the cache miss penalty is less severe.

        group prefetching
        Asynchronous memory access chaining
        Interleaving with Coroutines

        1. Travis Downs says:

          I would be surprised if techniques like this to include MLP on graph algorithms aren’t widely used in the industry. I have looked at very few graph traversal algorithms, but one I did look at a long time ago: the mark part of the JDK’s mark & sweep garbage collector definitely used a strategy to ensure multiple requests were in flight at once: the nodes to visit were kept in a queue, and prefetch requests were issued to the N elements at the head of the queue, to try to keep about N requests in flight at once.

          I remember reading it way back then and thinking “that’s clever!”. If you were really sure you were almost always missing to DRAM, you might even apply other tricks such as sorting the nodes in the queue so that they have a more linear access pattern, or visiting them in an order that was otherwise cache or memory controller friendly.

          1. It depends what you mean by “the industry”. There is related published work in the field of garbage collection, and I expect state-of-the-art garbage collectors to optimize accordingly.

            1. Travis Downs says:

              By “industry” I just mean the collection of people writing software for use, whether commercial, open source, internal use, whatever. More or less as opposed to “in the literature” and/or academia.

              Above there was some discussion of whether this idea had been explored, and a comment that its applicability to graphs perhaps hadn’t been explored. My claim then is that this has almost certainly been repeatedly explored “in the industry” (i.e., software about which no paper is written, or at least no paper exploring this issue) since one of the few graph algorithms I looked at long ago already used the trick of explicitly managing a queue of future memory accesses and prefetches to get MLP.

              Perhaps that “trick” originated in a GC paper, but in my experience that is not the normal direction of flow for such O(1) (i.e., not changing the complexity) tricks.

  2. I’m not surprised. Algorithms textbooks assume that the difference between a machine with 220 instructions in flight at a time and 1 instruction in flight is O(1), which of course it is. 🙂 However, there is a huge speedup that can be had by these sort of transformations if you can break dependencies. I’m working on a Random Forest implementation off-and-on and all the first-order issues I’m having with it are identical to what you discuss, although a binary tree is way easier to deal with than an arbitrary graph and more amenable to optimizations. It is interesting that there is way more work on naive parallelism of this sort of thing (“let’s chuck 8 cores at it”) than there is extracting the parallelism latent in one core. Of course, if you get too good at extracting per-core parallelism you will wind up potentially saturating uncore resources like L3 or the memory bus. But I have plenty of algorithms whose naive versions really do nothing but sit around and wait for L2 all day, and max-ing out L2 utilization is just fine as far as core scaling goes.

  3. Scott Hess says:

    The nice thing about an explicit prefetch version is that I, the reader, look at it and immediately know “This is designed to target a specific memory hierarchy”. The groups-of-8 version certainly can have a comment which describes why it is that way, but in my experience the more likely thing you’ll find is a piece of code which feels like it’s more complicated than it needs to be, without any explanation of how or why.

    1. I don’t disagree but the approach I have taken here is probably both suboptimal and overly complicated. I think you can get both better speed and simpler code with some effort.

  4. I’m thinking prefetch and gather are both potentially interesting optimization areas to look at, even though I have never had a particularly fun time with gather. Both address the problem of running out of places to stash data items fetched in a wide pipeline (in radically different ways).

  5. wmu says:

    I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours).

    I’d love to see results with less regular data, when the degree varies.

    1. I’d like to move this to real data, but there is more at the implementation side of things that could be done.

  6. Keith Burns says:

    VPP has been doing this for 15 years. Well worth a look for some brilliant performant patterns


    1. Well worth a look for some brilliant performant patterns

      Can you point to some of these patterns?

  7. Marwan Burelle says:


    Some years ago, we’ve done tests that somehow may be in line with this.

    We were trying to speed-up diameter algorithms (which involve a lot of breadth first traversal) and find out that just renumbering the graph with the order of a bfs can have huge impact. Our guess was that it increases locality. Unfortunately we had other projects more important …

  8. cwl says:

    Will this work when using openmp ?