Daniel Lemire's blog

, 21 min read

Almost picking N distinct numbers at random

24 thoughts on “Almost picking N distinct numbers at random”

  1. Travis Downs says:

    I think it’s close, but it’s not the geometric distribution exactly. Consider the case of picking 1 value out of 2 for example.

    You are right that each value is chosen with probability N/R (with R the size of the range), but the probabilities when considering what value is be picked next are not independent.

    1. I think it’s close, but it’s not the geometric distribution exactly. Consider the case of picking 1 value out of 2 for example.

      I think we agree. From my blog post: “My model is not quite correct. For one thing, we do not quite have a geometric distribution”.

  2. Travis Downs says:

    Also there is a concern with the failure mode: if you end up picking 999,999 values you stop early with less values but these values are at least somehow “well distributed”. What about the other case, where the fates direct you to pick 1,000,001 values? The algorithm cuts off early and values near the end of the range won’t be chosen. So I think there is a significant bias against values near the end of the range.

    1. So I think there is a significant bias against values near the end of the range.

      Yes. There is going to be a bias. Of course, we could stop well short of the end and finish off the generation using some other mean.

      1. Note that the reason there is a bias at the end follows directly from the fact that the geometric distribution is an imperfect approximation.

        1. Travis Downs says:

          Yes and no. Drawing from the wrong distribution in this leads to a subtle bias throughout the range, and it also leads to not getting exactly the “expected” size most of the time.

          However, just clipping off the array when you get to N is another thing and leads to a different, quite strong type of bias for the end elements.

          I think you should choose the same strategy you do for fewer elements: you can return a 999,999 array element, why not a 1,000,001 one? There is a symmetry there and it eliminates one big source of bias. Instead of stopping when i <= N || range_size <= (N - i) just stop when out[i] > range_max (of course, don’t actually assign that last element).

          Yes, from a C perspective, returning more than N elements is more problematic :).

          Finally once you get your result, you can stretch or shrink it to exactly N by choosing random locations to insert elements and choosing them randomly from the range implied by the insertion position (yes another source of bias, I think, but small and evenly distributed in the sense that every number should have equal likelihood to appear).

          1. I agree with your concerns regarding the tail.

  3. Travis Downs says:

    Part of the problem is that the C qsort starts at a significant disadvantage: it takes its comparator by function pointer and has to make an indirect function call to compare each pair of elements. For fast comparisons like between uint64_t that’s a lot of overhead for what boils doing to one assembly instruction.

    I sent a PR to add C++’s sort, everything else unchanged, and I find it to be almost 2x as fast as C’s sort (it’s the second reading here):

    Generating 1000000 values in [0, 40000000000), density: 0.002500 %
    timings: 0.171996 s 0.094119 s 0.063858 s
    timings per value: 171.996000 ns 94.119000 ns 63.858000 ns
    actual counts: 999993 999985 999998

    Generating 10000000 values in [0, 40000000000), density: 0.025000 %
    timings: 2.039629 s 1.103942 s 0.666784 s
    timings per value: 203.962900 ns 110.394200 ns 66.678400 ns
    actual counts: 9998779 9998723 9999999

    Generating 100000000 values in [0, 40000000000), density: 0.250000 %
    timings: 22.494848 s 12.375101 s 6.343137 s
    timings per value: 224.948480 ns 123.751010 ns 63.431370 ns
    actual counts: 99875477 99875035 100000000

    On this machine geometric is 3x to 4x as far as C sort, which is quite a bit faster (relatively) than your results. Maybe different compiler…

    1. Thanks Travis.

      (As you may suspect, I was aware that qsort was slower the C++ sort, but I was writing a pure C benchmark.)

      1. Travis Downs says:

        You can write fast sorts in C too, it’s just that qsort isn’t one of them (unless the comparator time is large). Otherwise you just show “why writing a generic function-pointer-taking sort in C is relatively slow”.

        1. My belief is that I used the “textbook” or standard way to sort an array in C. If my belief is wrong, I’d love to be corrected.

          1. Travis Downs says:

            Yes, as far as I know qsort is the “textbook” way to sort in C and a good example of why higher level languages can be faster than “textbook C”.

            1. I am not sure I understand why we can’t easily fix the C qsort. I realize that qsort is compiled, but, surely, it is not impossibly hard engineering to fix this? Can’t the compiler recognize the qsort call and do some inlining? Of course, it would only solve qsort, but so what?

              The compiler already handles something like memcpy to optimize away the function call when needed. How hard could it be to do the same with qsort?

              1. Travis Downs says:

                I don’t think it’s that hard, but it just doens’t matter a lot.

                If libc was shipped with some LTO payload (i.e., intermediate representation code), then you could do it with LTO: the comparison function could be inlined in the final compile step.

                That’s probably never going to happen because compiler vendors want to keep their LTO info version specific and be allowed to change it at any point.

                The real thing is that it just doesn’t matter: you can easily write your own search method, or copy/paste the source of qsort or any other good sort you find on the internet and have your comparator inlined.

                So in that sense it is not the same as memcpy: that’s about compiling memcpy into the calling code with information only the compiler knows about alignment and copy length: you have no real practical alternative to the compiler doing a good job here in many cases.

                You have a practical alternative to qsort: just don’t use the copy of the sort that has been compiled ahead of time and stuck in libc. C++ doens’t have this problem because sort is in a header, and JIT’d languages don’t have this problem because inlining happens at runtime.

                1. Travis Downs says:

                  Another solution for the primitive-valued-array case, implementable in the library only, would be if the standard library writers provided standard “compare int” type methods, and then qsort could check at runtime if a standard compare method was being used and then dispatch to the specialized version for that primitive.

                  Or it could inspect the machine code of the provided compare method and decide whether it was equivalent to the usual comparison – but now we’re just getting crazy.

    2. Travis Downs says:

      As it turns out, you can do much better than std::sort also. I used a radix sort which was several times faster than std::sort and led to the sort-based method for generating unique numbers being significantly faster than the geometric distribution method (although of course there may be a lot of optimization to be done on that one).

      I wrote more about radix sorts specifically over here.

      1. Thanks for sharing the link!

  4. Thomas Müller Graf says:

    I had a similar problem and implemented the following in my minperf project, RandomSetGenerator.java. It only need about 4-5 ns/entry.

    Instead of a random number generator, I used a seeded hashing / mixing algorithm, similar to mix64, for a counter. It is guaranteed to return unique numbers without having to check for uniqueness.

    If no sorting at all is needed, then you can just use the hashing algorithm. This is also used in fastfilter_cpp, random.h, GenerateRandom64Fast, for 64 bit numbers. But mix functions can be written for any other number of bits (I wrote hash48, hash44, and hash32). If the range is not a power of 2, one might want to discard numbers outside of the range, and/or recursively split the range.

    In my case, “some” sorting is needed (blocks with ascending ranges). For this, I recursively split the range. That will guarantee you get the right number of entries in total. Within each range, the “no sorting” approach from above is used.

    Splitting a range: I have used the normal distribution to calculate the number of expected entries below the middle (RandomSetGenerator.randomHalf). It is somewhat slow, but if the ranges are large this isn’t a problem.

    1. This is a great comment Thomas.

  5. seebs says:

    I wanted to do a thing similar to this at one point, stumbled across algorithms for generating permutations without having to generate the whole set, and realized that the first N values from a random permutation of M values are very much like a set of N distinct values from 0 to M… And if M is a power of 2, there’s established methods for using a block cipher like AES-128 to generate a block cipher for fewer than 128 bits. Then you just encode values 0, 1, 2, … and get out distinct values in that range. There’s some overhead, but the advantage is that you don’t have to store them or sort them or anything to avoid duplicates.

  6. sh1 says:

    Others have already noted the format-preserving encryption method. I built a table of mix function parameters for bits 8..128 so that it’s possible to create a lightweight mix function with a range less than double the required range so the overhead isn’t so bad. However, the real challenge is getting a function with decent coverage of the possible set of output permutations.

    I recently used the sort method in a component that did need monotonic output; but then I added an optional post-shuffle for completeness. In that case I supported an arbitrary minimum distance, and to overcome the overflow problem I reduced the random range to R - (N - 1) * d, then sorted, then addedi * d to each result[i]. I believe this is correct because any acceptable result cannot have more than one entry of value R - 1, etc., so the nth-to-last entry cannot be higher than R - n.

    I would have to think harder to be confident, but you could probably use a rejection loop when your geometric generator went out of bounds on a per-element basis (rather than rejecting a complete set) for correct results, and you could apply the same arbitrary-minimum-distance tweak at that point.

    Also, “99,999” doesn’t really approximate “100 million” as implied in the original post.

    1. sh1 says:

      Sorry, I should clarify. That table I linked lists parameters that would plug in to @Thomas’ hash44() function, above, to generalise it to the fewest number of bits for whatever range you need so the rejection loop doesn’t spend too much time rejecting out-of-range values.

      That hash construction relates back to Murmur3 and appears all over the place, these days. Somebody described the process of searching for parameters here: http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html

  7. Hi,
    Very interesting and insightful. Thanks for your blog entries!

    There is a related blog post here:
    http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
    I have added his method to your code and it is even faster:

    size_t veryfast_pick_N(uint64_t range_min, uint64_t range_max, size_t N, uint64_t *out) {
    uint64_t range_size = range_max – range_min;
    if(N == 0) return 0;
    if(range_size = range_max) return 0;
    out[0] = previous + range_min;
    size_t i = 1;
    for(; i < N; i++) {
    uint64_t newoffset = random_geometric(p);
    previous += newoffset + 1;
    if(previous >= range_max) {
    break;
    }
    out[i] = previous + range_min;
    }
    return i;
    }

    With that, I have an extra question/request, it would be great to be able to get a random sample of bits from a compressed bitmap. It seems to me that this code would be somewhat easy to use in Roaring Bitmap, am I right?
    Something like sample = bm.random_sample(N) that returns a bitmap with all the bits in bm bm & sample == sample and with a cardinality of N.
    Would you consider adding it?

    1. Would you consider adding it?

      Would you be willing to help out?