I don’t think your numbers are going to be distinct. In theory, it is possible that your approach would pick just one distinct value (repeated many times).
Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.
(p = print; floats are truncated for display purposes; sort from smallest to largest; qsorti = quick sort and return index)
When you’re using a high-level language (Perl Data Language, Matlab, etc.), you learn to avoid explicit loops by calling built-in functions that implicitly loop over vectors and matrices. In a high-level language, an algorithm without explicit loops is almost always much faster than an algorithm with explicit loops.
“Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.”
If all the floating point numbers are identical and the sort preserves the original order when there is a tie, then the output will be [0 1 2 3 …]. It is not a problem if this output happens from time to time. It is a valid output, as long as it doesn’t happen too often.
Mandeepsays:
Quick sort isn’t stable, so the relative ordering of elements might change (I’m assuming qsort.. of perl is using quick sort). OTOH if its using merge sort, then you’re right, the order will be preserved.
That’s right, but even a tie between two floating point numbers, if ties are always resolved in the same deterministic manner, will introduce a bias.
I assume that Perl will use 64-bit floating point numbers… in such a case, your algorithm to generate distinct 32-bit integers has a negligible bias.
(I should note that even my algorithms have biases in practice. They are only free of biases if I assume that I have a perfect random number generators.)
Of course, the interesting question is speed. The way you describe your algorithm, it runs in time O(Max), so that we might expect that when N is much smaller than Max, then your algorithm is slow compared to the hash set approach. My instinct is that even when N is close to Max, your algorithm is slower than the bitmap approach. Of course, I’ll need to verify this more seriously.
“The way you describe your algorithm, it runs in time O(Max)”
Actually O(Max log(Max)) due to sorting. But if you implement all the algorithms in a high-level language, I’m guessing mine will run the fastest, due to the lack of explicit loops. For me, the time and effort I save by writing programs in a high-level language is more important than the speed I can get from the computer by working in a lower-level language (in most cases, with some rare exceptions).
Also, what if you don’t want the final list in sorted order? What if the task is to shuffle the numbers in the list [0,Max]? (This is a typical task for my work.) Then your N/2 select/drop inversion trick is not applicable.
Much of the traditional computer science is focused on designing algorithms from basic operations (and this is what I do here with C++ and Java), but this is increasingly less relevant.
But not everything is black and white. For some problems, it is definitively worth it to get a 10x speed-up. Google’s backend could not be written in pure Perl. 😉
Back to the problem at hand…
If you want to shuffle a list, then a Fisher–Yates shuffle is probably best. Such shuffling is part of Java, C++(STL) and Python. I don’t know about Perl but I have read online that you can find a shuffle function in List::Util. So I would argue that in many instances, you shouldn’t code a list shuffling algorithm by hand.
You should check out format preserving encryption.
The basic idea is this: hashing has collisions, but encryption does not, so you can encrypt an int as you increment it to get distinct random numbers. A (solvable) challenge is finding an encryption algorithm that tightly fits the range of numbers you are generating.
I wonder how this compares?
(1) fill an array with 0 to Max random floating point numbers
(2) apply an index sort to the array (sort the array and return an index of elements in ascending order)
(3) output the first N values in the index
In Perl Data Language, this algorithm takes two lines:
$result = qsorti(random($max));
p $result(0:($n-1));
@Peter Turney
I don’t think your numbers are going to be distinct. In theory, it is possible that your approach would pick just one distinct value (repeated many times).
Update: I misread Peter’s description.
You’re wrong. I’ve been using exactly this code for years.
@Peter Turney
I misread your algorithm. Sorry.
Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.
I’ll add it to my benchmark later. Thanks.
An example might help:
pdl> $ran=random(4)
pdl> p $ran
[0.9474 0.8675 0.7389 0.4402]
pdl> $sort=qsorti($ran)
pdl> p $sort
[3 2 1 0]
(p = print; floats are truncated for display purposes; sort from smallest to largest; qsorti = quick sort and return index)
When you’re using a high-level language (Perl Data Language, Matlab, etc.), you learn to avoid explicit loops by calling built-in functions that implicitly loop over vectors and matrices. In a high-level language, an algorithm without explicit loops is almost always much faster than an algorithm with explicit loops.
“Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.”
If all the floating point numbers are identical and the sort preserves the original order when there is a tie, then the output will be [0 1 2 3 …]. It is not a problem if this output happens from time to time. It is a valid output, as long as it doesn’t happen too often.
Quick sort isn’t stable, so the relative ordering of elements might change (I’m assuming qsort.. of perl is using quick sort). OTOH if its using merge sort, then you’re right, the order will be preserved.
@Peter Turney
That’s right, but even a tie between two floating point numbers, if ties are always resolved in the same deterministic manner, will introduce a bias.
I assume that Perl will use 64-bit floating point numbers… in such a case, your algorithm to generate distinct 32-bit integers has a negligible bias.
(I should note that even my algorithms have biases in practice. They are only free of biases if I assume that I have a perfect random number generators.)
Of course, the interesting question is speed. The way you describe your algorithm, it runs in time O(Max), so that we might expect that when N is much smaller than Max, then your algorithm is slow compared to the hash set approach. My instinct is that even when N is close to Max, your algorithm is slower than the bitmap approach. Of course, I’ll need to verify this more seriously.
“The way you describe your algorithm, it runs in time O(Max)”
Actually O(Max log(Max)) due to sorting. But if you implement all the algorithms in a high-level language, I’m guessing mine will run the fastest, due to the lack of explicit loops. For me, the time and effort I save by writing programs in a high-level language is more important than the speed I can get from the computer by working in a lower-level language (in most cases, with some rare exceptions).
Also, what if you don’t want the final list in sorted order? What if the task is to shuffle the numbers in the list [0,Max]? (This is a typical task for my work.) Then your N/2 select/drop inversion trick is not applicable.
@Peter Turney
As for focusing on the performance one gets using Perl… see my post “The language interpreters are the new machines”: http://lemire.me/blog/archives/2011/06/14/the-language-interpreters-are-the-new-machines/
Much of the traditional computer science is focused on designing algorithms from basic operations (and this is what I do here with C++ and Java), but this is increasingly less relevant.
But not everything is black and white. For some problems, it is definitively worth it to get a 10x speed-up. Google’s backend could not be written in pure Perl. 😉
Back to the problem at hand…
If you want to shuffle a list, then a Fisher–Yates shuffle is probably best. Such shuffling is part of Java, C++(STL) and Python. I don’t know about Perl but I have read online that you can find a shuffle function in List::Util. So I would argue that in many instances, you shouldn’t code a list shuffling algorithm by hand.
You should check out format preserving encryption.
The basic idea is this: hashing has collisions, but encryption does not, so you can encrypt an int as you increment it to get distinct random numbers. A (solvable) challenge is finding an encryption algorithm that tightly fits the range of numbers you are generating.
http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/
A trick I picked up awhile ago was using block ciphers to generate N distinct random numbers.
For example, here’s an example of an (unbalanced) Feistel network which does just this:
uint16_t feistel(uint16_t counter, unsigned period)
{
uint16_t d = sqrt(period);
uint16_t s = period - d*d;
uint16_t q = counter / d;
uint16_t r = counter % s;
for (int i = 0; i < 3; i++)
{
uint16_t nr = r;
uint16_t F = (i * r + q);
r = F % d;
q = nr;
}
return q*d + r;
}
int main() {
for(int i = 1; i <= 16; i++) {
printf("%d %d\n", i, feistel(i, 15));
}
}
You can achieve a perfect permutation for any N (
period
here) by adjusting hownr
aeThese methods are producing permutations, but it’s a vanishingly small set of the possible ones.
A 100 element set has 100! possible permutations, and your algorithm can select no more than 65536 of them.