26th April 2020, 14 min read

Sampling efficiently from groups

17 thoughts on “Sampling efficiently from groups”

Melvin says:

April 27, 2020 at 6:46 am

Vose’s Alias method (see https://www.keithschwarz.com/darts-dice-coins/) is another approach. It works in constant time with linear time preprocessing using only simple data structures.
1. Daniel Lemire says:
  
  April 27, 2020 at 12:15 pm
  
  As you may notice in my blog post, I remove the students from the classrooms (without replacement). Do you know how to apply the Alias method to this scenario?
  
  My expectation is that the Alias is not applicable to the scenario I describe.
  1. Melvin says:
    
    April 27, 2020 at 12:38 pm
    
    Apologies for the confusion, I did indeed miss out the fact that you are sampling without replacement. The dynamic case is an interesting one, thanks for introducing it.
    1. Daniel Lemire says:
      
      April 27, 2020 at 2:15 pm
      
      I was aware of the Alias method: I should have included it in the post to avoid confusion.
hernan says:

April 27, 2020 at 2:39 pm

Thanks as usual for the informative blog post! Very minor point, but I believe you can remove these lines 34+35 from your sample Java program here

int l = 0; while((1<<l) < histo.length) { l++; }
1. Daniel Lemire says:
  
  April 28, 2020 at 12:00 pm
  
  I fixed it.
Thomas Müller Graf says:

April 28, 2020 at 7:49 am

This problem sounds related to adaptive arithmetic coding, and the newer asymmetric numeral systems (ANS).
Todd Lehman says:

April 28, 2020 at 8:42 am

Perhaps I am missing something fundamental in the problem description, but it seems to me that (a) repeatedly selecting students at random from classrooms, each of whose probability of being chosen is proportional to the number of students remaining in it, is identical to (b) simply making a list of all students, regardless of classroom, and shuffling that list randomly.

That is, it looks like the probabilities have been carefully defined so that each student always has a 1/k chance of being chosen, where k is total number of unchosen students at each iteration. How is that different from a straight shuffle or permutation operation?
1. Daniel Lemire says:
  
  April 28, 2020 at 12:05 pm
  
  “simply making a list of all students, regardless of classroom, and shuffling that list randomly” => This requires N units of storage.
  
  “How is that different from a straight shuffle or permutation operation?” => You are not allowed to do memory allocation beyond the maintenance of the histogram.
  1. Todd Lehman says:
    
    April 28, 2020 at 7:55 pm
    
    Aha! I see. So the classrooms must be treated as black boxes. That is, we don’t know the names or identities of the students in the rooms until after they are chosen; at the outset, we only know the number of students in each room, or the cardinality of each set we’re drawing from. It is then up to the teacher (or some agent of the room) to choose a student at random when we choose a room.
    
    So the goal is to use at most O(K) space, where K is the number of classrooms?
    
    And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
Daniel Lemire says:

April 28, 2020 at 8:10 pm

And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?

We know that if you are given a bit more storage, you can do better than O(N log K) time. It seems that O(N) is entirely within reach, but with extra storage. I’d be interested to know whether one can do better using only the K units of storage (please don’t turn it into O(K) units).
1. Todd Lehman says:
  
  April 28, 2020 at 9:26 pm
  
  Here is a thought: If we let S[k] be the number of students in classroom k (1 ≤ k ≤ K), then depending on the shape of the curve traced by the tuples (k,S[k]), it may be possible to model the curve in such a way that S[k] can be roughly predicted from k, and vice-versa.
  
  That is, given a random number n (1 ≤ n ≤ N), we can compute two boundaries k₁ and k₂ from which to encapsulate a binary search for the exact k from which the cumulative sum of S[k] terms represent the value n.
  
  Ultimately, however, this is still a binary search for the ordinal number of classroom implied by n—and thus still O(N log K)—and is probably not likely to be faster in practice, unless the shape of the curve of classroom sizes is ideally modelable.
  1. Daniel Lemire says:
    
    April 28, 2020 at 9:32 pm
    
    Actually, Todd, I think that this may well work in practice.
    
    https://en.wikipedia.org/wiki/Interpolation_search
    
    I am just not quite sure how to adapt it.
    
    Do you have, by any chance, workable pseudocode?
randomPoster says:

April 29, 2020 at 8:58 am

Your benchmark is a bit misleading: if you want all the students, just put all of them in the output and shuffle.

If you want less than a fraction of the students (let’s say 3 out of 4), and the distribution is not too skewed, a simple linear search with rejection sampling would work well in practice.

public static int rejectRandomSample(int[] histo, int[] output, int outputSize) { int [] currentHisto = Arrays.copyOf(histo, histo.length); int [] runningHisto = new int [histo.length+1]; System.arraycopy(histo, 0, runningHisto, 1, histo.length); for (int i = 2; i <= histo.length; i++) runningHisto[i] += runningHisto[i-1];
int sum = runningHisto[histo.length]; int averageSize = sum / histo.length; for(int pos = 0; pos < outputSize; ++pos) { int i = -1; while (true) { int y = r.nextInt(sum); i = y / averageSize; // hope the data is not too skewed while ( runningHisto[i] > y) i -= 1; while (runningHisto[i+1] <= y) i += 1; if (currentHisto[i] > y - runningHisto[i]) { currentHisto[i] -= 1; break; } } output[pos] = i; } return outputSize; }

I obtain the following benchmarks on my machine:

N=1024, M=100000 outputSize = s / 4 naive 260.04551556738215 smarter 74.68069439333325 reject 40.423464682798844 outputSize = 3 * s / 4 naive 258.509821210682 smarter 76.47785267880008 reject 65.39425675586646
1. Daniel Lemire says:
  
  April 29, 2020 at 12:14 pm
  
  You may want all students but may not be able, at any one time, to allocate memory. Thanks for your code.
  1. randomPoster says:
    
    April 29, 2020 at 3:13 pm
    
    To avoid the penalty when currentSum is low, you can also recompute the runningHisto regularly. Something like:
    
    public static int rejectRandomSample(int[] histo, int[] output, int outputSize) { int [] currentHisto = Arrays.copyOf(histo, histo.length); int [] runningHisto = new int [histo.length+1]; int sum = 1; int currSum = 0; int averageSize = 0; for(int pos = 0; pos < outputSize; ++pos) { if (sum * .95 > currSum) { System.arraycopy(currentHisto, 0, runningHisto, 1, histo.length); for(int i = 2; i <= histo.length; i++) runningHisto[i] += runningHisto[i-1]; sum = runningHisto[histo.length]; currSum = sum; averageSize = (sum / histo.length) + 1; } int i = -1; while (true) { int y = r.nextInt(sum); i = y / averageSize; // hope the data is not too skewed while ( runningHisto[i] > y) i -= 1; while (runningHisto[i+1] <= y) i += 1; if (currentHisto[i] > y - runningHisto[i]) { currentHisto[i] -= 1; break; } } output[pos] = i; currSum -= 1; } return outputSize; }
    
    gives me
    
    naive 250.25992953739302 smarter 74.97430178092274 reject 42.24443463267574
    1. randomPoster says:
      
      April 29, 2020 at 3:33 pm
      
      Just FTR, the last timings are for “outputSize = s” which means get all students.