Vose’s Alias method (see https://www.keithschwarz.com/darts-dice-coins/) is another approach. It works in constant time with linear time preprocessing using only simple data structures.
As you may notice in my blog post, I remove the students from the classrooms (without replacement). Do you know how to apply the Alias method to this scenario?
My expectation is that the Alias is not applicable to the scenario I describe.
Melvinsays:
Apologies for the confusion, I did indeed miss out the fact that you are sampling without replacement. The dynamic case is an interesting one, thanks for introducing it.
Perhaps I am missing something fundamental in the problem description, but it seems to me that (a) repeatedly selecting students at random from classrooms, each of whose probability of being chosen is proportional to the number of students remaining in it, is identical to (b) simply making a list of all students, regardless of classroom, and shuffling that list randomly.
That is, it looks like the probabilities have been carefully defined so that each student always has a 1/k chance of being chosen, where k is total number of unchosen students at each iteration. How is that different from a straight shuffle or permutation operation?
“simply making a list of all students, regardless of classroom, and shuffling that list randomly” => This requires N units of storage.
“How is that different from a straight shuffle or permutation operation?” => You are not allowed to do memory allocation beyond the maintenance of the histogram.
Todd Lehmansays:
Aha! I see. So the classrooms must be treated as black boxes. That is, we don’t know the names or identities of the students in the rooms until after they are chosen; at the outset, we only know the number of students in each room, or the cardinality of each set we’re drawing from. It is then up to the teacher (or some agent of the room) to choose a student at random when we choose a room.
So the goal is to use at most O(K) space, where K is the number of classrooms?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
We know that if you are given a bit more storage, you can do better than O(N log K) time. It seems that O(N) is entirely within reach, but with extra storage. I’d be interested to know whether one can do better using only the K units of storage (please don’t turn it into O(K) units).
Todd Lehmansays:
Here is a thought: If we let S[k] be the number of students in classroom k (1 ≤ k ≤ K), then depending on the shape of the curve traced by the tuples (k,S[k]), it may be possible to model the curve in such a way that S[k] can be roughly predicted from k, and vice-versa.
That is, given a random number n (1 ≤ n ≤ N), we can compute two boundaries k₁ and k₂ from which to encapsulate a binary search for the exact k from which the cumulative sum of S[k] terms represent the value n.
Ultimately, however, this is still a binary search for the ordinal number of classroom implied by n—and thus still O(N log K)—and is probably not likely to be faster in practice, unless the shape of the curve of classroom sizes is ideally modelable.
Your benchmark is a bit misleading: if you want all the students, just put all of them in the output and shuffle.
If you want less than a fraction of the students (let’s say 3 out of 4), and the distribution is not too skewed, a simple linear search with rejection sampling would work well in practice.
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
System.arraycopy(histo, 0, runningHisto, 1, histo.length);
for (int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
int sum = runningHisto[histo.length];
int averageSize = sum / histo.length;
for(int pos = 0; pos < outputSize; ++pos) {
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
}
return outputSize;
}
You may want all students but may not be able, at any one time, to allocate memory. Thanks for your code.
randomPostersays:
To avoid the penalty when currentSum is low, you can also recompute the runningHisto regularly. Something like:
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
int sum = 1;
int currSum = 0;
int averageSize = 0;
for(int pos = 0; pos < outputSize; ++pos) {
if (sum * .95 > currSum) {
System.arraycopy(currentHisto, 0, runningHisto, 1, histo.length);
for(int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
sum = runningHisto[histo.length];
currSum = sum;
averageSize = (sum / histo.length) + 1;
}
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
currSum -= 1;
}
return outputSize;
}
Vose’s Alias method (see https://www.keithschwarz.com/darts-dice-coins/) is another approach. It works in constant time with linear time preprocessing using only simple data structures.
As you may notice in my blog post, I remove the students from the classrooms (without replacement). Do you know how to apply the Alias method to this scenario?
My expectation is that the Alias is not applicable to the scenario I describe.
Apologies for the confusion, I did indeed miss out the fact that you are sampling without replacement. The dynamic case is an interesting one, thanks for introducing it.
I was aware of the Alias method: I should have included it in the post to avoid confusion.
Thanks as usual for the informative blog post! Very minor point, but I believe you can remove these lines 34+35 from your sample Java program here
int l = 0;
while((1<<l) < histo.length) { l++; }
I fixed it.
This problem sounds related to adaptive arithmetic coding, and the newer asymmetric numeral systems (ANS).
Perhaps I am missing something fundamental in the problem description, but it seems to me that (a) repeatedly selecting students at random from classrooms, each of whose probability of being chosen is proportional to the number of students remaining in it, is identical to (b) simply making a list of all students, regardless of classroom, and shuffling that list randomly.
That is, it looks like the probabilities have been carefully defined so that each student always has a 1/k chance of being chosen, where k is total number of unchosen students at each iteration. How is that different from a straight shuffle or permutation operation?
“simply making a list of all students, regardless of classroom, and shuffling that list randomly” => This requires N units of storage.
“How is that different from a straight shuffle or permutation operation?” => You are not allowed to do memory allocation beyond the maintenance of the histogram.
Aha! I see. So the classrooms must be treated as black boxes. That is, we don’t know the names or identities of the students in the rooms until after they are chosen; at the outset, we only know the number of students in each room, or the cardinality of each set we’re drawing from. It is then up to the teacher (or some agent of the room) to choose a student at random when we choose a room.
So the goal is to use at most O(K) space, where K is the number of classrooms?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
We know that if you are given a bit more storage, you can do better than O(N log K) time. It seems that O(N) is entirely within reach, but with extra storage. I’d be interested to know whether one can do better using only the K units of storage (please don’t turn it into O(K) units).
Here is a thought: If we let S[k] be the number of students in classroom k (1 ≤ k ≤ K), then depending on the shape of the curve traced by the tuples (k,S[k]), it may be possible to model the curve in such a way that S[k] can be roughly predicted from k, and vice-versa.
That is, given a random number n (1 ≤ n ≤ N), we can compute two boundaries k₁ and k₂ from which to encapsulate a binary search for the exact k from which the cumulative sum of S[k] terms represent the value n.
Ultimately, however, this is still a binary search for the ordinal number of classroom implied by n—and thus still O(N log K)—and is probably not likely to be faster in practice, unless the shape of the curve of classroom sizes is ideally modelable.
Actually, Todd, I think that this may well work in practice.
https://en.wikipedia.org/wiki/Interpolation_search
I am just not quite sure how to adapt it.
Do you have, by any chance, workable pseudocode?
Your benchmark is a bit misleading: if you want all the students, just put all of them in the output and shuffle.
If you want less than a fraction of the students (let’s say 3 out of 4), and the distribution is not too skewed, a simple linear search with rejection sampling would work well in practice.
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
System.arraycopy(histo, 0, runningHisto, 1, histo.length);
for (int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
int sum = runningHisto[histo.length];
int averageSize = sum / histo.length;
for(int pos = 0; pos < outputSize; ++pos) {
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
}
return outputSize;
}
I obtain the following benchmarks on my machine:
N=1024, M=100000
outputSize = s / 4
naive 260.04551556738215
smarter 74.68069439333325
reject 40.423464682798844
outputSize = 3 * s / 4
naive 258.509821210682
smarter 76.47785267880008
reject 65.39425675586646
You may want all students but may not be able, at any one time, to allocate memory. Thanks for your code.
To avoid the penalty when currentSum is low, you can also recompute the runningHisto regularly. Something like:
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
int sum = 1;
int currSum = 0;
int averageSize = 0;
for(int pos = 0; pos < outputSize; ++pos) {
if (sum * .95 > currSum) {
System.arraycopy(currentHisto, 0, runningHisto, 1, histo.length);
for(int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
sum = runningHisto[histo.length];
currSum = sum;
averageSize = (sum / histo.length) + 1;
}
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
currSum -= 1;
}
return outputSize;
}
gives me
naive 250.25992953739302
smarter 74.97430178092274
reject 42.24443463267574
Just FTR, the last timings are for “outputSize = s” which means get all students.