, 2 min read
Unrolling your loops can improve branch prediction
Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.
Does it mean that you should only care about hard-to-predict branches? Unfortunately no.
In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.
Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.
Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.
while (howmany != 0) {
uint64_t randomval = rng(howmany + seed);
if ((randomval & 1) == 1)
*out++ = randomval;
howmany--;
}
We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.
uint64_t randomval;
while (howmany >= 2) {
randomval = rng(howmany + seed);
if ((randomval & 1) == 1)
*out++ = randomval;
randomval = rng(howmany - 1 + seed);
if ((randomval & 1) == 1)
*out++ = randomval;
howmany-=2;
}
while (howmany != 0) {
uint64_t randomval = rng(howmany + seed);
if ((randomval & 1) == 1)
*out++ = randomval;
howmany--;
}
I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.
ARM 64-bit (Skylark):
trial | basic loop | unrolled twice | unrolled four times |
---|---|---|---|
1 | 58% | 57% | 56% |
2 | 48% | 33% | 26% |
3 | 48% | 28% | 21% |
15 | 45% | 18% | 8% |
Intel x64 (Cannon Lake):
trial | basic loop | unrolled twice | unrolled four times |
---|---|---|---|
1 | 53% | 51% | 49% |
2 | 50% | 35% | 30% |
3 | 46% | 20% | 17% |
15 | 40% | 0.5% | 0.4% |