Though SIMD instructions and the corresponding vectorization of the code are a form of “loop unrolling”, I implicitly ignored them for the purpose of this blog post (if you check my code, you’ll notice that I deliberately disabled vectorization).
This being said, it would not change the story much: the reason vectorization speed things up is because it drastically reduces the number of instructions.
Why doesn’t the processor have for loops built in? Set the amount of instructions per loop and how many iterations and then start the loop. This gets you the best of both worlds. Less overhead from running compare and jump instructions and less overhead from the total size of the instructions.
Processors do have optimizations specific to loops, but there is always some overhead. No free lunch.
Robertsays:
Also, branch predictions …
Pan Zhangsays:
Well, I think the processor probably has a micro architecture component called Loop Stream Detector(LSD) to help detect the loop and cache all the instruction decoding results for next iteration. Therefore, the performance improvement of unrolled loop does not come from fewer instructions to be executed and on the contrary, it is the more instructions needed to be executed in one iteration enables the possibilities for CPU back-end to execute in a more parallel manner.
My expectation goes into the other direction. Techniques like LSD are meant to accelerate tight loops. If you unroll too much, you lose these processor optimizations.
How does this compare with unrolling using fmadd in SSE/AVX?
Though SIMD instructions and the corresponding vectorization of the code are a form of “loop unrolling”, I implicitly ignored them for the purpose of this blog post (if you check my code, you’ll notice that I deliberately disabled vectorization).
This being said, it would not change the story much: the reason vectorization speed things up is because it drastically reduces the number of instructions.
Why doesn’t the processor have for loops built in? Set the amount of instructions per loop and how many iterations and then start the loop. This gets you the best of both worlds. Less overhead from running compare and jump instructions and less overhead from the total size of the instructions.
Processors do have optimizations specific to loops, but there is always some overhead. No free lunch.
Also, branch predictions …
Well, I think the processor probably has a micro architecture component called Loop Stream Detector(LSD) to help detect the loop and cache all the instruction decoding results for next iteration. Therefore, the performance improvement of unrolled loop does not come from fewer instructions to be executed and on the contrary, it is the more instructions needed to be executed in one iteration enables the possibilities for CPU back-end to execute in a more parallel manner.
My expectation goes into the other direction. Techniques like LSD are meant to accelerate tight loops. If you unroll too much, you lose these processor optimizations.