16th August 2013, 12 min read

Picking N distinct numbers at random: how to do it fast?

15 thoughts on “Picking N distinct numbers at random: how to do it fast?”

Peter Turney says:

August 24, 2013 at 1:53 pm

I wonder how this compares?

(1) fill an array with 0 to Max random floating point numbers

(2) apply an index sort to the array (sort the array and return an index of elements in ascending order)

(3) output the first N values in the index
Peter Turney says:

August 24, 2013 at 5:52 pm

In Perl Data Language, this algorithm takes two lines:

$result = qsorti(random($max));
p $result(0:($n-1));
Daniel Lemire says:

August 26, 2013 at 6:32 am

@Peter Turney

I don’t think your numbers are going to be distinct. In theory, it is possible that your approach would pick just one distinct value (repeated many times).

Update: I misread Peter’s description.
Peter Turney says:

August 26, 2013 at 6:37 am

You’re wrong. I’ve been using exactly this code for years.
Daniel Lemire says:

August 26, 2013 at 6:49 am

@Peter Turney

I misread your algorithm. Sorry.

Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.

I’ll add it to my benchmark later. Thanks.
Peter Turney says:

August 26, 2013 at 6:50 am

An example might help:

pdl> $ran=random(4)

pdl> p $ran

[0.9474 0.8675 0.7389 0.4402]

pdl> $sort=qsorti($ran)

pdl> p $sort

[3 2 1 0]

(p = print; floats are truncated for display purposes; sort from smallest to largest; qsorti = quick sort and return index)

When you’re using a high-level language (Perl Data Language, Matlab, etc.), you learn to avoid explicit loops by calling built-in functions that implicitly loop over vectors and matrices. In a high-level language, an algorithm without explicit loops is almost always much faster than an algorithm with explicit loops.
Peter Turney says:

August 26, 2013 at 6:53 am

“Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilisticâ€¦ with good probability, it will solve the problem.”

If all the floating point numbers are identical and the sort preserves the original order when there is a tie, then the output will be [0 1 2 3 …]. It is not a problem if this output happens from time to time. It is a valid output, as long as it doesn’t happen too often.
1. Mandeep says:
  
  October 31, 2018 at 5:48 am
  
  Quick sort isn’t stable, so the relative ordering of elements might change (I’m assuming qsort.. of perl is using quick sort). OTOH if its using merge sort, then you’re right, the order will be preserved.
Daniel Lemire says:

August 26, 2013 at 7:15 am

@Peter Turney

That’s right, but even a tie between two floating point numbers, if ties are always resolved in the same deterministic manner, will introduce a bias.

I assume that Perl will use 64-bit floating point numbers… in such a case, your algorithm to generate distinct 32-bit integers has a negligible bias.

(I should note that even my algorithms have biases in practice. They are only free of biases if I assume that I have a perfect random number generators.)

Of course, the interesting question is speed. The way you describe your algorithm, it runs in time O(Max), so that we might expect that when N is much smaller than Max, then your algorithm is slow compared to the hash set approach. My instinct is that even when N is close to Max, your algorithm is slower than the bitmap approach. Of course, I’ll need to verify this more seriously.
Peter Turney says:

August 26, 2013 at 7:37 am

“The way you describe your algorithm, it runs in time O(Max)”

Actually O(Max log(Max)) due to sorting. But if you implement all the algorithms in a high-level language, I’m guessing mine will run the fastest, due to the lack of explicit loops. For me, the time and effort I save by writing programs in a high-level language is more important than the speed I can get from the computer by working in a lower-level language (in most cases, with some rare exceptions).
Peter Turney says:

August 26, 2013 at 7:48 am

Also, what if you don’t want the final list in sorted order? What if the task is to shuffle the numbers in the list [0,Max]? (This is a typical task for my work.) Then your N/2 select/drop inversion trick is not applicable.
Daniel Lemire says:

August 26, 2013 at 10:25 am

@Peter Turney

As for focusing on the performance one gets using Perl… see my post “The language interpreters are the new machines”: http://lemire.me/blog/archives/2011/06/14/the-language-interpreters-are-the-new-machines/

Much of the traditional computer science is focused on designing algorithms from basic operations (and this is what I do here with C++ and Java), but this is increasingly less relevant.

But not everything is black and white. For some problems, it is definitively worth it to get a 10x speed-up. Google’s backend could not be written in pure Perl. 😉

Back to the problem at hand…

If you want to shuffle a list, then a Fisherâ€“Yates shuffle is probably best. Such shuffling is part of Java, C++(STL) and Python. I don’t know about Perl but I have read online that you can find a shuffle function in List::Util. So I would argue that in many instances, you shouldn’t code a list shuffling algorithm by hand.
Demofox says:

June 30, 2016 at 1:44 pm

You should check out format preserving encryption.

The basic idea is this: hashing has collisions, but encryption does not, so you can encrypt an int as you increment it to get distinct random numbers. A (solvable) challenge is finding an encryption algorithm that tightly fits the range of numbers you are generating.

http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/
zv says:

May 24, 2019 at 7:13 am

A trick I picked up awhile ago was using block ciphers to generate N distinct random numbers.

For example, here’s an example of an (unbalanced) Feistel network which does just this:

uint16_t feistel(uint16_t counter, unsigned period) { uint16_t d = sqrt(period); uint16_t s = period - d*d; uint16_t q = counter / d; uint16_t r = counter % s;
for (int i = 0; i < 3; i++) { uint16_t nr = r; uint16_t F = (i * r + q); r = F % d; q = nr; } return q*d + r; }
int main() { for(int i = 1; i <= 16; i++) { printf("%d %d\n", i, feistel(i, 15)); } }

You can achieve a perfect permutation for any N (period here) by adjusting how nr ae
1. Gé Weijers says:
  
  May 31, 2019 at 11:42 pm
  
  These methods are producing permutations, but it’s a vanishingly small set of the possible ones.
  
  A 100 element set has 100! possible permutations, and your algorithm can select no more than 65536 of them.