10th April 2017, 9 min read

Removing duplicates from lists quickly

13 thoughts on “Removing duplicates from lists quickly”

qznc says:

April 10, 2017 at 6:02 pm

As a compiler writer, I can nitpick a little about your “branchless” version. It does not matter for optimizing compilers, if you use a conditional “newv != oldv” as an if condition or not. The significant change is to make the store instruction unconditional (moving it out of the if branch). So `if (newv != oldv) { pos++; }` should be as fast as hope_unique.
Leonid Boytsov says:

April 10, 2017 at 6:13 pm

Great point, but does the optimizing compiler completely removes the branching operation in this case (and also in the case of a ternary operator)? Could it be (I am just perhaps fantasizing) that CPU’s can somehow optimize pipeline in the case of short jumps (without further branches?). For example, if you need to skip a few instructions, you don’t need to purge the pipeline completely, it likely already started processing instructions after the short jump. So, if the jump happens when it’s not expected, we can simply add a few micro-commands and continue (and remove commands from the failed branch). This should be much faster, right?
1. Daniel Lemire says:
  
  April 10, 2017 at 7:57 pm
  
  does the optimizing compiler completely removes the branching operation
  
  Yes as far as the assembly being produced. However different compilers implement it differently. The Intel compiler seems to rely on cmov whereas GCC favors setne followed by a move.
  
  For example, if you need to skip a few instructions, you don’t need to purge the pipeline completely, it likely already started processing instructions after the short jump. So, if the jump happens when it’s not expected, we can simply add a few micro-commands and continue (and remove commands from the failed branch). This should be much faster, right?
  
  In this case, there is data dependency between the steps, plus we have memory accesses.
Leonid Boytsov says:

April 10, 2017 at 6:14 pm

Nice post, great that you publish the AVX code. Would you care to elaborate what it does at a high level? I think this would be quite helpful.
1. Daniel Lemire says:
  
  April 10, 2017 at 6:17 pm
  
  I really should explain it better, shouldn’t I? More later…
  1. Leonid Boytsov says:
    
    April 10, 2017 at 7:43 pm
    
    This would helpful, please, also see my other question above. I didn’t check this, but does CPU really have a branchless increment that can be used instead of the explicit branch that would be otherwise used to implement a ternary operator?
Ian Henderson says:

April 10, 2017 at 10:27 pm

Could you rearrange uniqshuf to avoid the call to _mm256_permutevar8x32_epi32(recon,mbom)?
1. Daniel Lemire says:
  
  April 10, 2017 at 10:28 pm
  
  I don’t see how, but I’d be interested in any alternative design you might have.
Matt says:

April 10, 2017 at 11:06 pm

If you remove duplicates by sorting the vector first, then isn’t the O( n.log(n) ) sort going to be the slow part, rather than the subsequent O(n) removal of adjacent duplicates?

For sufficiently large n I’d have expected using an O(n) hash set to be faster because you don’t need to sort the input first. I’d also expect a hash set to be hard to vectorize however. I wonder how big n has to be before a hash set is faster in practice.
Vittorio Romeo says:

April 11, 2017 at 11:12 am

Nice post. Did you submit the “branchless” improvements to libstdc++/libc++?
1. Daniel Lemire says:
  
  April 11, 2017 at 1:46 pm
  
  It is possible that this type of optimization violates the STL specification.
  
  STL is not designed to be as fast as possible, but rather to be as generic as possible.
Jerry Coffin says:

April 11, 2017 at 8:45 pm

> In C++, we have an STL function for this very purpose: std::unique. On a recent Intel processor, it takes over 11 cycles per value in the array.

You really need to qualify this much more carefully. In particular, there is no “it” to discuss here. There are several different implementations of the standard library. You may be using one that happens to run at this speed–but I might be using one that’s half as fast, or twice as fast (or might use AVX code to implement it, so it’s the same speed as your fastest version).

The standard specifies an interface. It does include complexity guarantees, but producing the best implementation for a particular target (architecture, OS, whatever) is left to…the implementation.
1. Daniel Lemire says:
  
  April 11, 2017 at 9:10 pm
  
  You are correct. I am taking liberties, but I justify it by providing convenient access to my source code.