12th November 2019, 7 min read

Unrolling your loops can improve branch prediction

10 thoughts on “Unrolling your loops can improve branch prediction”

Michael R. Schmidt says:

November 12, 2019 at 11:38 pm

Also, not branching (unrolling) is faster than branching.
Leonid Boytsov says:

November 14, 2019 at 2:10 am

Interestingly, with simpler stuff, loop unrolling stopped be helpful quite a while ago. Perhaps, a compiler has been doing a better job compared to manual unrolling. What difference does it make here? The fact that there’s a conditional?
1. Wilco says:
  
  November 14, 2019 at 10:41 pm
  
  Loop unrolling significantly improves performance indeed. LLVM unrolls loops even with -O2 nowadays. Several targets are enabling it for -O2 in GCC10, motivated by higher SPEC scores.
  
  Generally unrolling reduces instruction counts and branches, improves load/store addressing (eg. merging 2 loads or stores into a load-pair or store-pair instruction) and enables more instruction level parallellism.
  
  The gain in this case is that removing the loop branch almost doubles the number of bits of the the branch history. This is used like a hash into the branch predictor tables, so more useful bits means it can not only remember more branches but also predict them better – much better in this case!
  1. Leonid Boytsov says:
    
    November 14, 2019 at 11:56 pm
    
    It’s a great point about increased capacity. However, if the loop is unrolled automatically, this would be true too. Apparently, GCC is more conservative in unrolling such loops?
    1. Wilco says:
      
      November 15, 2019 at 10:59 am
      
      Well GCC does not unroll at all at any optimization level if you don’t add -funroll-loops. However this will change with GCC10 for targets like AArch64 and Power which will unroll small loops with -O2 and higher.
      1. Daniel Lemire says:
        
        November 15, 2019 at 1:04 pm
        
        Are you implying that it won’t unroll under x64?
        
        Travis Downs says:
        
        November 15, 2019 at 3:29 pm
        
        Basically yes. GCC does not unroll most loops even at -03. It will unroll if you ask it to with -funroll-loops, or if you use profile guided optimization.
        
        There are some exceptions:
        
        Small compile-time-known tripcount loops may be fully unrolled (no loop remains).
        When loops are vectorized, they are implicitly unrolled since you need 4 or 8 or whatever iterations of the scalar loop to make a single vector iteration (however, unlike clang, the vector body isn’t further unrolled).
        
        Leonid Boytsov says:
        
        November 18, 2019 at 5:37 am
        
        Travis, great observation. Ok, now I see why some loops are unrolled (I was recently interested primarily in those that CAN be vectorized). Clearly, it’s nearly impossible to vectorize if the loop has conditionals. But, if you unroll these, you get better branch prediction (great to know).
        
        Wilco says:
        
        November 15, 2019 at 9:11 pm
        
        Indeed, GCC’s x86 developers seem opposed to the idea of unrolling eventhough there is clear evidence it helps modern cores…
        
        Note eventhough LLVM unrolls loops by default, it’s not able to unroll this particular loop, not even with -funroll-all-loops, while GCC does it with -funroll-loops.
        
        Travis Downs says:
        
        November 20, 2019 at 12:27 am
        
        There is clear evidence that unrolling some loops helps on modern codes.
        
        The evidence is less clear that aggressively unrolling all loops helps.
        
        Almost every loop is helped, in isolation, by unrolling. Many loops hurt, when in a real application, by unrolling.
        
        I don’t think the compiler vendors can solve that compromise: there is no absolutely correct answer. clang errs one way, gcc another. There is no happy medium because the right answer for a single function can be different depending on the application it is ultimately compiled into.
        
        Only full LTO combined with full and accurate PGO can solve this, so we might as well give on up that dream for now. We have to rely on developers to annotate key loops.