Daniel Lemire's blog

, 16 min read

Faster sorted array unions by reducing branches

21 thoughts on “Faster sorted array unions by reducing branches”

  1. Maynard+Handley says:

    You can generate marginally better scalar ARM64 Clang code by switching the

    pos2 = (v2 <= v1) ? pos2 + 1 : pos2;


    pos2 = (v1 >= v2) ? pos2 + 1 : pos2;

    which allows reuse of the initial cmp.
    (Yes, it’s dumb that clang doesn’t catch this. Sigh…)

    By my rough reckoning the M1’s critical path through the code is the use of the three integer units that handle flags/branches, and the above modification removes one instruction that requires these units.
    (The extent to which that rough calculation represents the true gating depends on just how much time is spent in that loop, allowing all the possible instruction unrolling overlap to occur.)

    The clang codegen still sucks, however. You can remove the increment of pos, via the horrible but traditional

    (Store the incoming value of output_buffer at the start, and subtract it away to get the return value. Compiler should do that!)

    The other clang’ism that looks suboptimal is it insists on dumb scheduling, splitting the cmp from the subsequent b.hs. No decent ARM64 core will care about the scheduling of one instruction between the cmp and the b.hs; but most will lose the opportunity to fuse the cmp and the b.hs into a single executable op.

    Since this removes a second instruction from our critical path of “instructions that use the flags/branch units” it’s probably worth forcing in assembly if this were production code. (Of course then you’d go vector, I assume…)

    [Also, just on style grounds, I think it’s better to have

    while ((pos1 < size1) && (pos2 < size2)) {

    ie replace the & with an && even though that doesn’t affect code gen.]

    1. Thanks.

      Yes. Before using assembly, I would definitively use SIMD instructions.

      I use…

      A & B

      to indicate that both A and B can be evaluated as opposed to

      A && B

      where I expect only A to be evaluated if it is false.

  2. Veedrac says:

    Sort the input arrays such that input1 will always be exhausted first. This removes a comparison from the loop and deletes one of the trailing loops.

    Use pos1 += v1 <= v2; pos2 += v2 <= v1; to force branchless execution.

  3. Alexey M. says:

    I wonder if

    pos1 = (v1 <= v2) ? pos1 + 1 : pos1;

    could be changed to

    pos1 += !!(v1 <= v2);


    1. George Spelvin says:

      You don’t need the “!!”; <= already returns a boolean (exclusively 0 or 1) result.

      I do like to wrap such conditional expressions in () just to be clear, i.e.
      pos1 += (v1 <= v2);

      But, yes, such an expression makes clear to readers that I expect branch-free code. When the increment amount isn’t 1, then it’s more of a toss-up between:
      pos += (condition) ? 42 : 0;
      pos += -(condition) & 42;
      The former is more readable, but the latter is closer to what I hope most processors (other than ARM) will generate.

  4. Sokolov Yura says:

    Storing comparison result in variable makes GCC to use conditional set/move: https://godbolt.org/z/zs3WzqjqP

    1. Sokolov Yura says:

      Well, looks like size_t variable and difference is better for GCC https://godbolt.org/z/4n4963sKj

      1. I have added a corrected version of your function to the benchmark on github.

        1. Sokolov Yura says:

          There is a bug in both union2by2_branchless_variable and union2by2_branchless_ptr: !(v1 <= v2) != (v2 <= v1) , ie merge procedure drops one of two equal elements.

        2. Sokolov Yura says:

          and union2by2_branchless too.

        3. Sokolov Yura says:

          Ahh, I see: task were to union sorted sets ie arrays with all unique elements. I missed “unique” part of a task.

  5. Ted says:

    in llvm , I bench it, https://quick-bench.com/q/tNdjEe-55p6zPjDGF-jBIgDYJAQ
    bm_union2by2_branchless_ptr version is not fast as bm_union2by2_branchless_variable and bm_union2by2, why? only because it’s pointer?

    1. I do not know the answer in this particular instance, but it is not uncommon for code written directly in terms of pointers to be slower due to the overhead of doing arithmetic in address space. Compilers often do a better job when you use indexes. For speed, I recommend working with indexes by default unless you have a specific reason to prefer pointer arithmetic.

      I included the pointer approach in my code because someone said it was faster.

  6. Per Vognsen says:

    There are two relatively simple ways to extract more instruction-level parallelism from a branchless merge routine since it’s so latency limited; I estimate the carried dependency latency as 6 cycles on Zen 3. The simplest is to merge from the front and back at the same time. The next level beyond that is to take the middle element (median) of the larger array and split the smaller array by it using one binary search. That splits the problem into two subproblems to which you can apply bidirectional merging. Splitting by the median like this is the idea behind parallel merge sort, but it also works well for extracting ILP.

    1. Per Vognsen says:

      Here’s a simple benchmark I just whipped up. For large arrays with millions of random uint32 elements, on my Zen 3 laptop the unidirectional merge1 function runs in 7 cycles/elem and the bidirectional merge2 function runs in 3.5 cycles/elem. The binary search median splitting is the same idea but with 4 merge frontiers instead of 2 (you can add more frontiers but register pressure becomes an issue).


      1. Per Vognsen says:

        And just as I say posted that code, I realized this doesn’t apply directly to sorted set union/intersection/difference (at least not without intermediate buffers or a final compaction pass) since you don’t know the cardinality of the final result ahead of time. (I wonder if that is an argument in favor of representing sorted sets as a list or tree of array chunks rather than forcing a flat array.)

        If nothing else, hopefully these ideas for extracting more parallelism from branchless merging (as used merge in sort) will be helpful to people who haven’t seen it before.

  7. Nathan Kurz says:

    Some quick thoughts having only explored this a little bit:

    1) I think Veedrac has a good suggestion above to remove a comparison from the main loop. Instead of checking to see if you have reached the end of either list on each iteration, if you “peek” at the end of each list at the beginning you only have to check the list that you know will end first. That is, add a wrapper function that compares the last element of each list, and swaps input1/2 and size1/2 before starting the merge. This removes a comparison per iteration.

    2) While assigning the result of the comparison to a variable does convince GCC to use a fully branchless approach, it still produces longer code than Clang. The difference (as you and others have probably already realized) is that Clang realizes that only a single comparison is necessary if it uses “sbb” (subtract with borrow). Since a comparison is internally just a subtraction, this might not save much, but it does simplify the inner loop.

    3) In union2by2_branchless_variable, the use of “d” and “d2” to store the comparison results seems like a parody of bad computer science naming conventions. Given how much you value code clarity, I’m surprised by your use of such unhelpful variable names. Maybe “use_v1” and “use_v2”? I ended up with this: https://godbolt.org/z/53rhaoebc (with a wrapper function needed to possibly reorder the inputs, and a tail needed to copy the remaining elements).

    1. Good points.

      Note that the first optimization applies no matter how you compute the union.

      1. I have actually implemented it (see GitHub).

        Sadly it is an optimization that will be made useless when bound checks are present (I expect).

  8. Thomas Müller says:

    The galloping mode of Timsort comes to my mind. It can reduce the number of comparisons. The problem might be slightly different however (union vs union all).

  9. alecco says:

    There are vectorized merges, see “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture” Intel (2010), 4.2.3 Merging Two Sorted Lists