29th November 2018, 5 min read

Quickly sampling from two arrays (C++ edition)

6 thoughts on “Quickly sampling from two arrays (C++ edition)”

A Subscriber says:

December 1, 2018 at 3:11 pm

I don’t think the reservoir sampling is suitable here: the technique was explicitly designed to handle situations where the exact population size, N, is not known upfront. In particular it requires N steps in every N-choose-K scenario, even when only K steps are necessary. I believe Floyd’s sampling algorithm should work better, since it is explicitly O(K).
1. Daniel Lemire says:
  
  December 1, 2018 at 3:24 pm
  
  You are correct that reservoir sampling is applicable even when the size of the array is unknown. I don’t think that it makes it unadvisable in other instances.
  
  In my context, I am interested in sampling a fixed fraction of the input array (e.g., 50%). In such cases, the best possible complexity is linear in the size of the input.
  
  My choice of algorithms and my implementation is surely not optimal, and certainly not in all cases. I’d be very interested in a new, better implementation that can beat the 12 CPU cycles per input element I get. Can you please provide a code sample that I can benchmark?
  1. A Subscriber says:
    
    December 1, 2018 at 5:35 pm
    
    I apologize for jumping to conclusions way too quickly and for shouting too loudly. 🙁
    
    You are certainly correct that changing the sampling algorithm would buy absolutely nothing in the 50% case. Specifically, reservoir sampling requires N-K RNG calls while Floyd’s requires K. Obviously, these are exactly same numbers.
    
    I am very sorry for not paying attention.
    1. Daniel Lemire says:
      
      December 1, 2018 at 6:10 pm
      
      Your comment was interesting and forced me to elaborate on the context.
      
      When sampling small numbers of values, you are quite correct that my approach would be inefficient.
      
      Thanks for commenting.
Daniel Watson says:

December 2, 2018 at 5:38 am

This seems like a good opportunity for the Merge algorithm from the merge shuffle paper. It merges two shuffled arrays into a single shuffled array. (It can also be vectorized with AVX and likely quite well with AVX512).

Unfortunately it requires N-permutations of the two source arrays. Vectorized shuffling would help but with large arrays random memory access might slow things down.

A quick impl of Merge + sse41_xorshift128plus_partialshuffle32 shows ~3.8 speedup, using your code.
1. Daniel Watson says:
  
  December 6, 2018 at 1:56 am
  
  Turns out a vectorized reservoir sampling (with _mm_maskstore_ps) is even faster, by about 2.5x.