Daniel Lemire's blog

, 23 min read

Mispredicted branches can multiply your running times

28 thoughts on “Mispredicted branches can multiply your running times”

  1. Janos says:

    Good article!
    I think the second random in the second and third code examples should be replaced by val.

    1. Thanks.

  2. Phil says:

    But in the second one, the last value in the out[] array could be odd, no?

    1. LB says:

      You mean it could be even? If so, yes, but index won’t have been advanced so you will know to ignore it.

  3. Evan says:

    Removing branches can be pretty tricky. I’d be interested to hear your thoughts on __builtin_expect / __builtin_expect_with_probability / __builtin_unpredictable if you have any.

    I know MSVC has historically refused to implement anything similar, but with [[likely]] and [[unlikely]] coming in C++2a (which I’m guessing MSVC will support in VS 2021) it should be possible to have something portable to most modern compilers.

    1. If the branch is predictable and you want to tell so to the compiler, this can help because the compiler can reorder the instructions so that the likely path is close to the stream of instructions. I have never seen huge gains from this approach, but I have measured some worthwhile further gains… after optimizing almost everything else.

      However, if the branch is unpredictable, and there is no way to remove the branching, then what can the compiler do? I do not know.

      1. Evan says:

        That pretty much matches up with my experience. IIRC I’ve only seen around ~1-2% improvements on x86/x86_64 in real code.

        It’s just so easy, especially compared to figuring out how to make something branchless (if possible). It’s also pretty handy for error checking macros since the improvement tends to get replicated all over the place for free.

        1. Travis Downs says:

          The likely/unlikely hints haven’t in principle or practice helped much with branch prediction problems like the one described here, because they indicate branches that are likely to go a specific way, hence “predictable”. At best, they will organize code so the “fall through” path is the likely taken one, as Daniel mentions, which helps for a variety of reasons. This often helps a bit – but you could certainly see some small loops that get re-arranged for a 100% benefit (e.g., doubled in speed), but due to front-end effects, not branch prediction per-se.

          __builtin_unpredictable on the other hand could have a big effect here, since in practice it says “try harder to make this branch free”. Compilers can make things branch free much more than they do – they are reluctant to do this because most branches are predictable, and a branchfree transformation often slows things down for predictable branches. So they are conservative and only choose branchfree for the simplest cases or where some (IMO often dubious) heuristic indicates it is worthwhile (e.g., PGO although that’s not really in the dubious category). Empirically, I have not seen gcc insert more than 2 cmov instructions in order to remove one branch.

          Hopefully builtin_unpredictable can give a huge hint to make something branch free, even if it costs more than that.

    2. Yeah, likely/unlikely hints do not affect branch prediction (maybe only as a side effect).

      You can see examples where they do help

      here: https://easyperf.net/blog/2019/03/27/Machine-code-layout-optimizatoins#basic-block-placement

      and here:https://easyperf.net/blog/2019/04/10/Performance-analysis-and-tuning-contest-2#improving-machine-block-placement-15

  4. Yes, compilers can’t do such tricks, because those 2 versions have different observable behavior. The version where stores to memory are guarded under if ‘statement’ can’t touch the memory of even elements. I tend to think this also prevents vectorization of the loop. Daniel, you probably checked that but still, is there unexpected difference in generated assembly?

    1. Can you elaborate on ” unexpected difference in generated assembly”?

      1. Like, one of the versions is vectorized, the other is not.

        1. I use GNU GCC with -O2, so there is no autovectorization.

  5. Wilco says:

    For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

    Actually many compilers have been doing exactly this routinely and automatically for many decades! GCC generates a branchless sequence for the example on Arm by default: https://www.godbolt.org/z/ZemW4j

    It can give major speedups in common cases, especially on simpler, in-order cores which don’t have advanced (read: complex and huge) branch predictors. On advanced CPUs the gain is more modest, a few percent on average.

    This particular example uses a conditional store which is non-trivial if you don’t have full conditional execution like Arm. Compilers can safely emit such a store using a conditional move:

    *(cond ? p : &dummy_var) = x;

    Loads can be done in the same way – so it’s always possible to remove branches from code if there is at least one conditional instruction (it’s not always faster though – that’s the hard part!).

    As for giving branch and inlining hints to compilers, only worry about that for highly performance critical functions where it will help. It should be based on execution statistics of real code, not a “guess” as this is often incorrect. I’ve used these hints in GLIBC in the fast paths in malloc/free, highly optimized string and math functions with significant gains. However it can be counterproductive if you don’t know what you’re doing or forget to disassemble your code to verify the compiler optimized it the way you expected.

    1. foobar says:

      The method of storing to a dummy memory location is a new one to me! I think it would translate into 1-3 extra decoded instructions (executed uops) per store on Intel, which would be trivially beneficial on large amount of unpredictable code, unless it would cause register spilling or the memory location would act as a dependency on a tight loop.

      Sadly it sounds like you can’t really make the compiler do it. I mean, you can hardcode this functionality on your code, but not make the compiler choose it or a branchy variant depending on which is deemed more profitable on basis of static analysis / profiling / hints.

    2. foobar says:

      Interestingly I think the compiler could perform the dummy pointer optimization without violating any rule in the C standard, or such. Just allocate the dummy location by the runtime (either from the stack – in which case it could be an stack location already easily available on a register, or from a dedicated thread-local memory mapping), and choose either the real or dummy store address for the store. I don’t think this would violate any rules any more than something like register spilling or ABI-specific register handling does.

      Of course another question that arises from all this is that why x86 doesn’t have a (non-vectored) conditional store instruction for this purpose…

      1. Travis Downs says:

        The compiler could definitely do it, but as you’ve mentioned I’ve never seen it. Usually the compiler prefers branches. It knows how to remove some types of branches for “mathematical” expressions using conditional moves, but I’ve never seen it use dummy stores (or dummy loads, a similar technique) – although there’s nothing preventing it in the standard. Using a stack location would be fine.

        You can use the idiom by hand though (although maybe the optimizer will see though it and revert it).

        1. foobar says:

          I think one reason why such an optimization is missing is that “unpredictable” hints are a rather recent thing and I haven’t really heard of branch prediction profiling driven optimizations. Also, this solution doesn’t arise spontaneously as a combination of other optimizations.

          I wonder how hard it would be to implement this on a modern compiler. They emit conditional stores on architectures that support them, so it shouldn’t be awfully hard to implement the variant with fixed dummy destination, somewhat harder to reuse pointers to valid but unused locations in existing registers.

          1. Travis Downs says:

            One of the main optimizations that PFO performs is branch related, i.e., calculating the taken % of branches which affect code generation. I believe it can, for example, result in a cmov where no combination of hints would have without PGO.

            Agree with everything else.

  6. Orson Peters says:

    There’s multiple bugs in the faster version. Firstly, it doesn’t output howmany elements, it can output way fewer or even none. Secondly, the last element it outputs can be even as it never gets overridden.

    Finally, in this toy problem the real answer is simply to set out[index] = random() | 1; to get uniform odd numbers.

    1. These are not bugs. It is never specified that I output “howmany” values.

      1. Orson Peters says:

        I… eh.. what? That is simply not true, if we cross-reference your code to the numbers you report. You report “cycles per integer”, and you compute this (for all algorithms) as total_cycles / howmany, regardless of how many integers the algorithm actually output.

        Furthermore, we are looking at performance optimizations. If your optimized implementation isn’t functionally the same as the non-optimized one, then what is the point of it all?

        Worse, the bug in your optimized implementation causes it to simply do less work. Of course it is faster, it just decides to stop halfway through! About half of the random values will be odd. So at the end you can expect that your optimized buggy implementation has output about howmany/2 elements. And then it stops, having output about half as much as requested.

        This also makes your measurements bogus, the 3.8 cycles per integer for the optimized implementation you report is wrong. It is at least twice that, perhaps more.

        1. You misread the post. There is no bug.

          1. Orson Peters says:

            Alright, final attempt to communicate the issue by making the example more extreme. Let’s modify your code to only generate random integers from [0, 5) instead of odd numbers.

            while (howmany != 0) {
            val = random();
            out[index] = val;
            index += val < 5; // MODIFIED
            howmany--;
            }

            Will you still report “3.8 cycles per integer” for this random number generation code, even though it only successfully outputs (assuming val is uint64_t like in your actual code) an integer 5/2^64 of the time? You’ll have to run this code for hundreds of hours with an incredibly large howmany before it’ll even output a single number.

            1. Yes. It is the number of iterations, that is the number of branches, that matter for this experiment. In fact, you could modify this problem to compute the last odd integer generated or something else. Actually I pretty much assume that writing things out is free. If you read my blog post, at no point do I talk in terms of the size of the output.

              The point of this blog post is in the title “Mispredicted branches can multiply your running times”. It is not about generating random numbers quickly. That’s also an interesting problem but, as you point out, generating odd integers is certainly not interesting.

              If you look through the content of my blog post, you will find work on random number generation, and performance. This particular blog post is not about that.

  7. Pete DiMarco says:

    This will probably sound sacrilegious, but isn’t the real lesson here that branch prediction is a poor method of optimization? After all, branch prediction gave us the Spectre vulnerability. Wouldn’t it be better to either:

    Explore both branches at the same time using 2 isolated cores/pipelines/compute units, and then clearing the one that is invalid, or
    Avoid the problem entirely by using a faster (photonic?) memory bus.

    I realize that may sound a bit naive, but surely it’s better than doing things like adding neural networks to CPUs for branch prediction…

    1. Exploring multiple paths at once can quickly become prohibitive… the number of code paths grows exponentially… so it is not feasible over a long window.

      I don’t think that the problem is memory speed per se. The problem can occur even if you are just in L1 cache.

      1. Pete DiMarco says:

        Thanks for the reply. I’m a software engineer, so my knowledge of the subject is limited. I just read this article, which seems to be a nice introduction to branch prediction:

        https://danluu.com/branch-prediction/

        Maybe the next step is to get rid of pipelines all together? What if we had pools of pipeline stages and dynamically chained them together during thread execution? I’m imagining something analogous to how microservices are used to build distributed applications. (Of course, ideas are easy. Implementation is the hard part.)