Interestingly, with simpler stuff, loop unrolling stopped be helpful quite a while ago. Perhaps, a compiler has been doing a better job compared to manual unrolling. What difference does it make here? The fact that there’s a conditional?
Wilcosays:
Loop unrolling significantly improves performance indeed. LLVM unrolls loops even with -O2 nowadays. Several targets are enabling it for -O2 in GCC10, motivated by higher SPEC scores.
Generally unrolling reduces instruction counts and branches, improves load/store addressing (eg. merging 2 loads or stores into a load-pair or store-pair instruction) and enables more instruction level parallellism.
The gain in this case is that removing the loop branch almost doubles the number of bits of the the branch history. This is used like a hash into the branch predictor tables, so more useful bits means it can not only remember more branches but also predict them better – much better in this case!
It’s a great point about increased capacity. However, if the loop is unrolled automatically, this would be true too. Apparently, GCC is more conservative in unrolling such loops?
Wilcosays:
Well GCC does not unroll at all at any optimization level if you don’t add -funroll-loops. However this will change with GCC10 for targets like AArch64 and Power which will unroll small loops with -O2 and higher.
Basically yes. GCC does not unroll most loops even at -03. It will unroll if you ask it to with -funroll-loops, or if you use profile guided optimization.
There are some exceptions:
Small compile-time-known tripcount loops may be fully unrolled (no loop remains).
When loops are vectorized, they are implicitly unrolled since you need 4 or 8 or whatever iterations of the scalar loop to make a single vector iteration (however, unlike clang, the vector body isn’t further unrolled).
Travis, great observation. Ok, now I see why some loops are unrolled (I was recently interested primarily in those that CAN be vectorized). Clearly, it’s nearly impossible to vectorize if the loop has conditionals. But, if you unroll these, you get better branch prediction (great to know).
Wilcosays:
Indeed, GCC’s x86 developers seem opposed to the idea of unrolling eventhough there is clear evidence it helps modern cores…
Note eventhough LLVM unrolls loops by default, it’s not able to unroll this particular loop, not even with -funroll-all-loops, while GCC does it with -funroll-loops.
Travis Downssays:
There is clear evidence that unrolling some loops helps on modern codes.
The evidence is less clear that aggressively unrolling all loops helps.
Almost every loop is helped, in isolation, by unrolling. Many loops hurt, when in a real application, by unrolling.
I don’t think the compiler vendors can solve that compromise: there is no absolutely correct answer. clang errs one way, gcc another. There is no happy medium because the right answer for a single function can be different depending on the application it is ultimately compiled into.
Only full LTO combined with full and accurate PGO can solve this, so we might as well give on up that dream for now. We have to rely on developers to annotate key loops.
Also, not branching (unrolling) is faster than branching.
Interestingly, with simpler stuff, loop unrolling stopped be helpful quite a while ago. Perhaps, a compiler has been doing a better job compared to manual unrolling. What difference does it make here? The fact that there’s a conditional?
Loop unrolling significantly improves performance indeed. LLVM unrolls loops even with -O2 nowadays. Several targets are enabling it for -O2 in GCC10, motivated by higher SPEC scores.
Generally unrolling reduces instruction counts and branches, improves load/store addressing (eg. merging 2 loads or stores into a load-pair or store-pair instruction) and enables more instruction level parallellism.
The gain in this case is that removing the loop branch almost doubles the number of bits of the the branch history. This is used like a hash into the branch predictor tables, so more useful bits means it can not only remember more branches but also predict them better – much better in this case!
It’s a great point about increased capacity. However, if the loop is unrolled automatically, this would be true too. Apparently, GCC is more conservative in unrolling such loops?
Well GCC does not unroll at all at any optimization level if you don’t add -funroll-loops. However this will change with GCC10 for targets like AArch64 and Power which will unroll small loops with -O2 and higher.
Are you implying that it won’t unroll under x64?
Basically yes. GCC does not unroll most loops even at -03. It will unroll if you ask it to with -funroll-loops, or if you use profile guided optimization.
There are some exceptions:
Small compile-time-known tripcount loops may be fully unrolled (no loop remains).
When loops are vectorized, they are implicitly unrolled since you need 4 or 8 or whatever iterations of the scalar loop to make a single vector iteration (however, unlike clang, the vector body isn’t further unrolled).
Travis, great observation. Ok, now I see why some loops are unrolled (I was recently interested primarily in those that CAN be vectorized). Clearly, it’s nearly impossible to vectorize if the loop has conditionals. But, if you unroll these, you get better branch prediction (great to know).
Indeed, GCC’s x86 developers seem opposed to the idea of unrolling eventhough there is clear evidence it helps modern cores…
Note eventhough LLVM unrolls loops by default, it’s not able to unroll this particular loop, not even with -funroll-all-loops, while GCC does it with -funroll-loops.
There is clear evidence that unrolling some loops helps on modern codes.
The evidence is less clear that aggressively unrolling all loops helps.
Almost every loop is helped, in isolation, by unrolling. Many loops hurt, when in a real application, by unrolling.
I don’t think the compiler vendors can solve that compromise: there is no absolutely correct answer. clang errs one way, gcc another. There is no happy medium because the right answer for a single function can be different depending on the application it is ultimately compiled into.
Only full LTO combined with full and accurate PGO can solve this, so we might as well give on up that dream for now. We have to rely on developers to annotate key loops.