I was rather impressed with SVE and thought of it as a clearly superior vector instruction set. Having to do manual loop unrolling for performance would negate much of what makes SVE such a nice instruction set.
It is notable, however, that the problem you benchmarked is basically two instructions, plus a load/store pair, plus loop overhead. Most actual code I deal with is much larger, reducing the relative cost of the loop overhead. That also reduces the benefit of loop unrolling and I would typically not bother.
Even if some loops are as simple as your example, they typically don’t dominate runtime and one should concentrate more on other code. So after my initial shock, I don’t think this is such a big problem for SVE.
Still, good to know how about such problems. Thank you for highlighting them!
My post is not meant to imply that there is a problem with SVE.
Travis Downssays:
I would disagree that small loops like this are uncommon.
Small loops like this are common, and form the basis for optimized versions of common library routines like memcpy, strlen, memchr, and routines in higher level languages, etc. They also form important primitives in applications like databases where you might wish to take the bitwise AND or OR of two bitmaps, etc.
Furthermore, small loops are the ones where you stand the best chance of getting a good auto-vectorization out of the compiler, further increasing their importance under vectorization.
In my experience there is a *huge* amount to gain from modest unrolls of 2-8 iterations of many real-world vectorized (and not vectorized) loops, even without SVE.
Samuel Leesays:
Indeed! Being able to handle small buffer tail of an otherwise unrolled loop with a few predicated SVE instructions is much cleaner than having to fall back to scalar code.
For large inputs, unrolling can definitely be beneficial; not only do you reduce the proportion of instructions that are doing the loop handling, in many cases you can reduce the dependencies between instructions (e.g. if instructions in the body of the loop depends on the loop counter, you can end up serializing on updates to the loop counter, reducing the ability to take advantage of instruction-level-parallelism).
These benefits are independent of the instruction set, provided you have enough registers to play with.
In this case it also a benefit because the throughput of predicate handling instructions appears to be limited for the V1, and in the unrolled loop we can make assumptions to reduce proportion of instructions that use this critical resource.
I think ideally compilers would be able to automatically do this sort of unrolling of SVE code in the future (whether autovectorized or intrinsics).
> since it is best on
I suspect you meant “based on”.
Correct. Thanks.
Thank you both for doing this work!
I was rather impressed with SVE and thought of it as a clearly superior vector instruction set. Having to do manual loop unrolling for performance would negate much of what makes SVE such a nice instruction set.
It is notable, however, that the problem you benchmarked is basically two instructions, plus a load/store pair, plus loop overhead. Most actual code I deal with is much larger, reducing the relative cost of the loop overhead. That also reduces the benefit of loop unrolling and I would typically not bother.
Even if some loops are as simple as your example, they typically don’t dominate runtime and one should concentrate more on other code. So after my initial shock, I don’t think this is such a big problem for SVE.
Still, good to know how about such problems. Thank you for highlighting them!
My post is not meant to imply that there is a problem with SVE.
I would disagree that small loops like this are uncommon.
Small loops like this are common, and form the basis for optimized versions of common library routines like memcpy, strlen, memchr, and routines in higher level languages, etc. They also form important primitives in applications like databases where you might wish to take the bitwise AND or OR of two bitmaps, etc.
Furthermore, small loops are the ones where you stand the best chance of getting a good auto-vectorization out of the compiler, further increasing their importance under vectorization.
In my experience there is a *huge* amount to gain from modest unrolls of 2-8 iterations of many real-world vectorized (and not vectorized) loops, even without SVE.
Indeed! Being able to handle small buffer tail of an otherwise unrolled loop with a few predicated SVE instructions is much cleaner than having to fall back to scalar code.
For large inputs, unrolling can definitely be beneficial; not only do you reduce the proportion of instructions that are doing the loop handling, in many cases you can reduce the dependencies between instructions (e.g. if instructions in the body of the loop depends on the loop counter, you can end up serializing on updates to the loop counter, reducing the ability to take advantage of instruction-level-parallelism).
These benefits are independent of the instruction set, provided you have enough registers to play with.
In this case it also a benefit because the throughput of predicate handling instructions appears to be limited for the V1, and in the unrolled loop we can make assumptions to reduce proportion of instructions that use this critical resource.
I think ideally compilers would be able to automatically do this sort of unrolling of SVE code in the future (whether autovectorized or intrinsics).