I don’t think the reservoir sampling is suitable here: the technique was explicitly designed to handle situations where the exact population size, N, is not known upfront. In particular it requires N steps in every N-choose-K scenario, even when only K steps are necessary. I believe Floyd’s sampling algorithm should work better, since it is explicitly O(K).
You are correct that reservoir sampling is applicable even when the size of the array is unknown. I don’t think that it makes it unadvisable in other instances.
In my context, I am interested in sampling a fixed fraction of the input array (e.g., 50%). In such cases, the best possible complexity is linear in the size of the input.
My choice of algorithms and my implementation is surely not optimal, and certainly not in all cases. I’d be very interested in a new, better implementation that can beat the 12 CPU cycles per input element I get. Can you please provide a code sample that I can benchmark?
A Subscribersays:
I apologize for jumping to conclusions way too quickly and for shouting too loudly. 🙁
You are certainly correct that changing the sampling algorithm would buy absolutely nothing in the 50% case. Specifically, reservoir sampling requires N-K RNG calls while Floyd’s requires K. Obviously, these are exactly same numbers.
Your comment was interesting and forced me to elaborate on the context.
When sampling small numbers of values, you are quite correct that my approach would be inefficient.
Thanks for commenting.
Daniel Watsonsays:
This seems like a good opportunity for the Merge algorithm from the merge shuffle paper. It merges two shuffled arrays into a single shuffled array. (It can also be vectorized with AVX and likely quite well with AVX512).
Unfortunately it requires N-permutations of the two source arrays. Vectorized shuffling would help but with large arrays random memory access might slow things down.
A quick impl of Merge + sse41_xorshift128plus_partialshuffle32 shows ~3.8 speedup, using your code.
Daniel Watsonsays:
Turns out a vectorized reservoir sampling (with _mm_maskstore_ps) is even faster, by about 2.5x.
I don’t think the reservoir sampling is suitable here: the technique was explicitly designed to handle situations where the exact population size, N, is not known upfront. In particular it requires N steps in every N-choose-K scenario, even when only K steps are necessary. I believe Floyd’s sampling algorithm should work better, since it is explicitly O(K).
You are correct that reservoir sampling is applicable even when the size of the array is unknown. I don’t think that it makes it unadvisable in other instances.
In my context, I am interested in sampling a fixed fraction of the input array (e.g., 50%). In such cases, the best possible complexity is linear in the size of the input.
My choice of algorithms and my implementation is surely not optimal, and certainly not in all cases. I’d be very interested in a new, better implementation that can beat the 12 CPU cycles per input element I get. Can you please provide a code sample that I can benchmark?
I apologize for jumping to conclusions way too quickly and for shouting too loudly. 🙁
You are certainly correct that changing the sampling algorithm would buy absolutely nothing in the 50% case. Specifically, reservoir sampling requires N-K RNG calls while Floyd’s requires K. Obviously, these are exactly same numbers.
I am very sorry for not paying attention.
Your comment was interesting and forced me to elaborate on the context.
When sampling small numbers of values, you are quite correct that my approach would be inefficient.
Thanks for commenting.
This seems like a good opportunity for the
Merge
algorithm from the merge shuffle paper. It merges two shuffled arrays into a single shuffled array. (It can also be vectorized with AVX and likely quite well with AVX512).Unfortunately it requires N-permutations of the two source arrays. Vectorized shuffling would help but with large arrays random memory access might slow things down.
A quick impl of
Merge
+sse41_xorshift128plus_partialshuffle32
shows ~3.8 speedup, using your code.Turns out a vectorized reservoir sampling (with
_mm_maskstore_ps
) is even faster, by about 2.5x.