Daniel Lemire's blog

, 14 min read

Sampling efficiently from groups

17 thoughts on “Sampling efficiently from groups”

  1. Melvin says:

    Vose’s Alias method (see https://www.keithschwarz.com/darts-dice-coins/) is another approach. It works in constant time with linear time preprocessing using only simple data structures.

    1. As you may notice in my blog post, I remove the students from the classrooms (without replacement). Do you know how to apply the Alias method to this scenario?

      My expectation is that the Alias is not applicable to the scenario I describe.

      1. Melvin says:

        Apologies for the confusion, I did indeed miss out the fact that you are sampling without replacement. The dynamic case is an interesting one, thanks for introducing it.

        1. I was aware of the Alias method: I should have included it in the post to avoid confusion.

  2. hernan says:

    Thanks as usual for the informative blog post! Very minor point, but I believe you can remove these lines 34+35 from your sample Java program here

    int l = 0;
    while((1<<l) < histo.length) { l++; }

    1. I fixed it.

  3. This problem sounds related to adaptive arithmetic coding, and the newer asymmetric numeral systems (ANS).

  4. Todd Lehman says:

    Perhaps I am missing something fundamental in the problem description, but it seems to me that (a) repeatedly selecting students at random from classrooms, each of whose probability of being chosen is proportional to the number of students remaining in it, is identical to (b) simply making a list of all students, regardless of classroom, and shuffling that list randomly.

    That is, it looks like the probabilities have been carefully defined so that each student always has a 1/k chance of being chosen, where k is total number of unchosen students at each iteration. How is that different from a straight shuffle or permutation operation?

    1. “simply making a list of all students, regardless of classroom, and shuffling that list randomly” => This requires N units of storage.

      “How is that different from a straight shuffle or permutation operation?” => You are not allowed to do memory allocation beyond the maintenance of the histogram.

      1. Todd Lehman says:

        Aha! I see. So the classrooms must be treated as black boxes. That is, we don’t know the names or identities of the students in the rooms until after they are chosen; at the outset, we only know the number of students in each room, or the cardinality of each set we’re drawing from. It is then up to the teacher (or some agent of the room) to choose a student at random when we choose a room.

        So the goal is to use at most O(K) space, where K is the number of classrooms?

        And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?

  5. And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?

    We know that if you are given a bit more storage, you can do better than O(N log K) time. It seems that O(N) is entirely within reach, but with extra storage. I’d be interested to know whether one can do better using only the K units of storage (please don’t turn it into O(K) units).

    1. Todd Lehman says:

      Here is a thought: If we let S[k] be the number of students in classroom k (1 ≤ k ≤ K), then depending on the shape of the curve traced by the tuples (k,S[k]), it may be possible to model the curve in such a way that S[k] can be roughly predicted from k, and vice-versa.

      That is, given a random number n (1 ≤ n ≤ N), we can compute two boundaries k₁ and k₂ from which to encapsulate a binary search for the exact k from which the cumulative sum of S[k] terms represent the value n.

      Ultimately, however, this is still a binary search for the ordinal number of classroom implied by n—and thus still O(N log K)—and is probably not likely to be faster in practice, unless the shape of the curve of classroom sizes is ideally modelable.

      1. Actually, Todd, I think that this may well work in practice.

        https://en.wikipedia.org/wiki/Interpolation_search

        I am just not quite sure how to adapt it.

        Do you have, by any chance, workable pseudocode?

  6. randomPoster says:

    Your benchmark is a bit misleading: if you want all the students, just put all of them in the output and shuffle.

    If you want less than a fraction of the students (let’s say 3 out of 4), and the distribution is not too skewed, a simple linear search with rejection sampling would work well in practice.

    public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
    int [] currentHisto = Arrays.copyOf(histo, histo.length);
    int [] runningHisto = new int [histo.length+1];
    System.arraycopy(histo, 0, runningHisto, 1, histo.length);
    for (int i = 2; i <= histo.length; i++)
    runningHisto[i] += runningHisto[i-1];

    int sum = runningHisto[histo.length];
    int averageSize = sum / histo.length;
    for(int pos = 0; pos < outputSize; ++pos) {
    int i = -1;
    while (true) {
    int y = r.nextInt(sum);
    i = y / averageSize; // hope the data is not too skewed
    while ( runningHisto[i] > y) i -= 1;
    while (runningHisto[i+1] <= y) i += 1;
    if (currentHisto[i] > y - runningHisto[i]) {
    currentHisto[i] -= 1;
    break;
    }
    }
    output[pos] = i;
    }
    return outputSize;
    }

    I obtain the following benchmarks on my machine:

    N=1024, M=100000
    outputSize = s / 4
    naive 260.04551556738215
    smarter 74.68069439333325
    reject 40.423464682798844
    outputSize = 3 * s / 4
    naive 258.509821210682
    smarter 76.47785267880008
    reject 65.39425675586646

    1. You may want all students but may not be able, at any one time, to allocate memory. Thanks for your code.

      1. randomPoster says:

        To avoid the penalty when currentSum is low, you can also recompute the runningHisto regularly. Something like:

        public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
        int [] currentHisto = Arrays.copyOf(histo, histo.length);
        int [] runningHisto = new int [histo.length+1];
        int sum = 1;
        int currSum = 0;
        int averageSize = 0;
        for(int pos = 0; pos < outputSize; ++pos) {
        if (sum * .95 > currSum) {
        System.arraycopy(currentHisto, 0, runningHisto, 1, histo.length);
        for(int i = 2; i <= histo.length; i++)
        runningHisto[i] += runningHisto[i-1];
        sum = runningHisto[histo.length];
        currSum = sum;
        averageSize = (sum / histo.length) + 1;
        }
        int i = -1;
        while (true) {
        int y = r.nextInt(sum);
        i = y / averageSize; // hope the data is not too skewed
        while ( runningHisto[i] > y) i -= 1;
        while (runningHisto[i+1] <= y) i += 1;
        if (currentHisto[i] > y - runningHisto[i]) {
        currentHisto[i] -= 1;
        break;
        }
        }
        output[pos] = i;
        currSum -= 1;
        }
        return outputSize;
        }

        gives me

        naive 250.25992953739302
        smarter 74.97430178092274
        reject 42.24443463267574

        1. randomPoster says:

          Just FTR, the last timings are for “outputSize = s” which means get all students.