You can generate marginally better scalar ARM64 Clang code by switching the
pos2 = (v2 <= v1) ? pos2 + 1 : pos2;
to
pos2 = (v1 >= v2) ? pos2 + 1 : pos2;
which allows reuse of the initial cmp.
(Yes, it’s dumb that clang doesn’t catch this. Sigh…)
By my rough reckoning the M1’s critical path through the code is the use of the three integer units that handle flags/branches, and the above modification removes one instruction that requires these units.
(The extent to which that rough calculation represents the true gating depends on just how much time is spent in that loop, allowing all the possible instruction unrolling overlap to occur.)
The clang codegen still sucks, however. You can remove the increment of pos, via the horrible but traditional
*output_buffer++
(Store the incoming value of output_buffer at the start, and subtract it away to get the return value. Compiler should do that!)
The other clang’ism that looks suboptimal is it insists on dumb scheduling, splitting the cmp from the subsequent b.hs. No decent ARM64 core will care about the scheduling of one instruction between the cmp and the b.hs; but most will lose the opportunity to fuse the cmp and the b.hs into a single executable op.
Since this removes a second instruction from our critical path of “instructions that use the flags/branch units” it’s probably worth forcing in assembly if this were production code. (Of course then you’d go vector, I assume…)
[Also, just on style grounds, I think it’s better to have
while ((pos1 < size1) && (pos2 < size2)) {
ie replace the & with an && even though that doesn’t affect code gen.]
Yes. Before using assembly, I would definitively use SIMD instructions.
I use…
A & B
to indicate that both A and B can be evaluated as opposed to
A && B
where I expect only A to be evaluated if it is false.
Veedracsays:
Sort the input arrays such that input1 will always be exhausted first. This removes a comparison from the loop and deletes one of the trailing loops.
Use pos1 += v1 <= v2; pos2 += v2 <= v1; to force branchless execution.
Alexey M.says:
I wonder if
pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
could be changed to
pos1 += !!(v1 <= v2);
?
George Spelvinsays:
You don’t need the “!!”; <= already returns a boolean (exclusively 0 or 1) result.
I do like to wrap such conditional expressions in () just to be clear, i.e.
pos1 += (v1 <= v2);
But, yes, such an expression makes clear to readers that I expect branch-free code. When the increment amount isn’t 1, then it’s more of a toss-up between:
pos += (condition) ? 42 : 0;
pos += -(condition) & 42;
The former is more readable, but the latter is closer to what I hope most processors (other than ARM) will generate.
I have added a corrected version of your function to the benchmark on github.
Sokolov Yurasays:
There is a bug in both union2by2_branchless_variable and union2by2_branchless_ptr: !(v1 <= v2) != (v2 <= v1) , ie merge procedure drops one of two equal elements.
Sokolov Yurasays:
and union2by2_branchless too.
Sokolov Yurasays:
Ahh, I see: task were to union sorted sets ie arrays with all unique elements. I missed “unique” part of a task.
I do not know the answer in this particular instance, but it is not uncommon for code written directly in terms of pointers to be slower due to the overhead of doing arithmetic in address space. Compilers often do a better job when you use indexes. For speed, I recommend working with indexes by default unless you have a specific reason to prefer pointer arithmetic.
I included the pointer approach in my code because someone said it was faster.
Per Vognsensays:
There are two relatively simple ways to extract more instruction-level parallelism from a branchless merge routine since it’s so latency limited; I estimate the carried dependency latency as 6 cycles on Zen 3. The simplest is to merge from the front and back at the same time. The next level beyond that is to take the middle element (median) of the larger array and split the smaller array by it using one binary search. That splits the problem into two subproblems to which you can apply bidirectional merging. Splitting by the median like this is the idea behind parallel merge sort, but it also works well for extracting ILP.
Per Vognsensays:
Here’s a simple benchmark I just whipped up. For large arrays with millions of random uint32 elements, on my Zen 3 laptop the unidirectional merge1 function runs in 7 cycles/elem and the bidirectional merge2 function runs in 3.5 cycles/elem. The binary search median splitting is the same idea but with 4 merge frontiers instead of 2 (you can add more frontiers but register pressure becomes an issue).
And just as I say posted that code, I realized this doesn’t apply directly to sorted set union/intersection/difference (at least not without intermediate buffers or a final compaction pass) since you don’t know the cardinality of the final result ahead of time. (I wonder if that is an argument in favor of representing sorted sets as a list or tree of array chunks rather than forcing a flat array.)
If nothing else, hopefully these ideas for extracting more parallelism from branchless merging (as used merge in sort) will be helpful to people who haven’t seen it before.
Nathan Kurzsays:
Some quick thoughts having only explored this a little bit:
1) I think Veedrac has a good suggestion above to remove a comparison from the main loop. Instead of checking to see if you have reached the end of either list on each iteration, if you “peek” at the end of each list at the beginning you only have to check the list that you know will end first. That is, add a wrapper function that compares the last element of each list, and swaps input1/2 and size1/2 before starting the merge. This removes a comparison per iteration.
2) While assigning the result of the comparison to a variable does convince GCC to use a fully branchless approach, it still produces longer code than Clang. The difference (as you and others have probably already realized) is that Clang realizes that only a single comparison is necessary if it uses “sbb” (subtract with borrow). Since a comparison is internally just a subtraction, this might not save much, but it does simplify the inner loop.
3) In union2by2_branchless_variable, the use of “d” and “d2” to store the comparison results seems like a parody of bad computer science naming conventions. Given how much you value code clarity, I’m surprised by your use of such unhelpful variable names. Maybe “use_v1” and “use_v2”? I ended up with this: https://godbolt.org/z/53rhaoebc (with a wrapper function needed to possibly reorder the inputs, and a tail needed to copy the remaining elements).
Sadly it is an optimization that will be made useless when bound checks are present (I expect).
Thomas Müllersays:
The galloping mode of Timsort comes to my mind. It can reduce the number of comparisons. The problem might be slightly different however (union vs union all).
aleccosays:
There are vectorized merges, see “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture” Intel (2010), 4.2.3 Merging Two Sorted Lists
You can generate marginally better scalar ARM64 Clang code by switching the
pos2 = (v2 <= v1) ? pos2 + 1 : pos2;
to
pos2 = (v1 >= v2) ? pos2 + 1 : pos2;
which allows reuse of the initial cmp.
(Yes, it’s dumb that clang doesn’t catch this. Sigh…)
By my rough reckoning the M1’s critical path through the code is the use of the three integer units that handle flags/branches, and the above modification removes one instruction that requires these units.
(The extent to which that rough calculation represents the true gating depends on just how much time is spent in that loop, allowing all the possible instruction unrolling overlap to occur.)
The clang codegen still sucks, however. You can remove the increment of pos, via the horrible but traditional
*output_buffer++
(Store the incoming value of output_buffer at the start, and subtract it away to get the return value. Compiler should do that!)
The other clang’ism that looks suboptimal is it insists on dumb scheduling, splitting the cmp from the subsequent b.hs. No decent ARM64 core will care about the scheduling of one instruction between the cmp and the b.hs; but most will lose the opportunity to fuse the cmp and the b.hs into a single executable op.
Since this removes a second instruction from our critical path of “instructions that use the flags/branch units” it’s probably worth forcing in assembly if this were production code. (Of course then you’d go vector, I assume…)
[Also, just on style grounds, I think it’s better to have
while ((pos1 < size1) && (pos2 < size2)) {
ie replace the & with an && even though that doesn’t affect code gen.]
Thanks.
Yes. Before using assembly, I would definitively use SIMD instructions.
I use…
to indicate that both A and B can be evaluated as opposed to
where I expect only A to be evaluated if it is false.
Sort the input arrays such that input1 will always be exhausted first. This removes a comparison from the loop and deletes one of the trailing loops.
Use
pos1 += v1 <= v2; pos2 += v2 <= v1;
to force branchless execution.I wonder if
pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
could be changed to
pos1 += !!(v1 <= v2);
?
You don’t need the “!!”; <= already returns a boolean (exclusively 0 or 1) result.
I do like to wrap such conditional expressions in () just to be clear, i.e.
pos1 += (v1 <= v2);
But, yes, such an expression makes clear to readers that I expect branch-free code. When the increment amount isn’t 1, then it’s more of a toss-up between:
pos += (condition) ? 42 : 0;
pos += -(condition) & 42;
The former is more readable, but the latter is closer to what I hope most processors (other than ARM) will generate.
Storing comparison result in variable makes GCC to use conditional set/move: https://godbolt.org/z/zs3WzqjqP
Well, looks like size_t variable and difference is better for GCC https://godbolt.org/z/4n4963sKj
I have added a corrected version of your function to the benchmark on github.
There is a bug in both union2by2_branchless_variable and union2by2_branchless_ptr:
!(v1 <= v2) != (v2 <= v1)
, ie merge procedure drops one of two equal elements.and union2by2_branchless too.
Ahh, I see: task were to union sorted sets ie arrays with all unique elements. I missed “unique” part of a task.
in llvm , I bench it, https://quick-bench.com/q/tNdjEe-55p6zPjDGF-jBIgDYJAQ
bm_union2by2_branchless_ptr version is not fast as bm_union2by2_branchless_variable and bm_union2by2, why? only because it’s pointer?
I do not know the answer in this particular instance, but it is not uncommon for code written directly in terms of pointers to be slower due to the overhead of doing arithmetic in address space. Compilers often do a better job when you use indexes. For speed, I recommend working with indexes by default unless you have a specific reason to prefer pointer arithmetic.
I included the pointer approach in my code because someone said it was faster.
There are two relatively simple ways to extract more instruction-level parallelism from a branchless merge routine since it’s so latency limited; I estimate the carried dependency latency as 6 cycles on Zen 3. The simplest is to merge from the front and back at the same time. The next level beyond that is to take the middle element (median) of the larger array and split the smaller array by it using one binary search. That splits the problem into two subproblems to which you can apply bidirectional merging. Splitting by the median like this is the idea behind parallel merge sort, but it also works well for extracting ILP.
Here’s a simple benchmark I just whipped up. For large arrays with millions of random uint32 elements, on my Zen 3 laptop the unidirectional merge1 function runs in 7 cycles/elem and the bidirectional merge2 function runs in 3.5 cycles/elem. The binary search median splitting is the same idea but with 4 merge frontiers instead of 2 (you can add more frontiers but register pressure becomes an issue).
https://gist.github.com/pervognsen/fa2a8cba0dc852094067348a57492ecb
And just as I say posted that code, I realized this doesn’t apply directly to sorted set union/intersection/difference (at least not without intermediate buffers or a final compaction pass) since you don’t know the cardinality of the final result ahead of time. (I wonder if that is an argument in favor of representing sorted sets as a list or tree of array chunks rather than forcing a flat array.)
If nothing else, hopefully these ideas for extracting more parallelism from branchless merging (as used merge in sort) will be helpful to people who haven’t seen it before.
Some quick thoughts having only explored this a little bit:
1) I think Veedrac has a good suggestion above to remove a comparison from the main loop. Instead of checking to see if you have reached the end of either list on each iteration, if you “peek” at the end of each list at the beginning you only have to check the list that you know will end first. That is, add a wrapper function that compares the last element of each list, and swaps input1/2 and size1/2 before starting the merge. This removes a comparison per iteration.
2) While assigning the result of the comparison to a variable does convince GCC to use a fully branchless approach, it still produces longer code than Clang. The difference (as you and others have probably already realized) is that Clang realizes that only a single comparison is necessary if it uses “sbb” (subtract with borrow). Since a comparison is internally just a subtraction, this might not save much, but it does simplify the inner loop.
3) In union2by2_branchless_variable, the use of “d” and “d2” to store the comparison results seems like a parody of bad computer science naming conventions. Given how much you value code clarity, I’m surprised by your use of such unhelpful variable names. Maybe “use_v1” and “use_v2”? I ended up with this: https://godbolt.org/z/53rhaoebc (with a wrapper function needed to possibly reorder the inputs, and a tail needed to copy the remaining elements).
Good points.
Note that the first optimization applies no matter how you compute the union.
I have actually implemented it (see GitHub).
Sadly it is an optimization that will be made useless when bound checks are present (I expect).
The galloping mode of Timsort comes to my mind. It can reduce the number of comparisons. The problem might be slightly different however (union vs union all).
There are vectorized merges, see “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture” Intel (2010), 4.2.3 Merging Two Sorted Lists