6th June 2019, 28 min read

Nearly Divisionless Random Integer Generation On Various Systems

38 thoughts on “Nearly Divisionless Random Integer Generation On Various Systems”

Jeff Epler says:

June 7, 2019 at 12:28 am

It looks like you may have incorrectly transcribed a value for the ARM table “Floating point approach (biased)”.
1. Daniel Lemire says:
  
  June 7, 2019 at 2:15 am
  
  Correct. Thanks for pointing it out.
Alex says:

June 7, 2019 at 1:02 pm

Hi Daniel,

I don’t understand,
uint64_t t = -s % s;

I interpret that as the remainder when -s is divided by s.

Tnx.
1. Daniel Lemire says:
  
  June 7, 2019 at 10:25 pm
  
  Think: (2^64-s) mod s
  1. degski says:
    
    June 8, 2019 at 4:54 am
    
    For -s % s to compile in VC one has to write (0-s) % s.
    1. Daniel Lemire says:
      
      June 8, 2019 at 2:12 pm
      
      The code sample I have provided in my blog post is unlikely to compile under Visual Studio for other reasons as well.
      1. Stefan Kanthak says:
        
        July 28, 2019 at 1:29 am
        
        This compiles with MSVC:
        
        uint64_t nearlydivisionless ( uint64_t s ) { uint64_t x = random64 () ; uint64_t m; uint64_t l = _umul128(x, s, &m); if (l < s) { for (uint64_t t = (0ULL - s) % s; l < t; l = _umul128(x, s, &m)) x = random64 (); } return m; }
        
        The _umul128() intrinsic is available for AMD64 only; it can but trivially implemented for I386: see https://skanthak,homepage.t-online.de/nomsvcrt.html
Jörn Engel says:

June 7, 2019 at 9:23 pm

I have a question on the array-shuffling algorithm. Your approach seems to be

for (int i = len – 1; i > 0; i–)
swap(a+i, a+random_range(i));

A more naïve approach would be

for (int i = len – 1; i >= 0; i–)
swap(a+i, a+random_range(len));

Is there a reason why the second algorithm shouldn’t be used? Or is it simply slower because (for large arrays) it spills from L1 to L2 or L3 or RAM more often than the first?
1. Daniel Lemire says:
  
  June 7, 2019 at 9:35 pm
  
  @Jörn
  
  Do you have a proof that your proposed algorithm is correct? That is, assuming that random_range is a perfect generator, you generate all permutations with equal probability?
  
  The algorithm I use is from Knuth’s. It is correct. There might be other, faster algorithms, but to be considered, they must be known to be correct.
2. Travis Downs says:
  
  June 7, 2019 at 10:55 pm
  
  The method is biased, although it is not entirely obvious.
  
  If you work though the math by hand for the case of 3 elements, you’ll see that element 0 is less more likely to end up in position 1 than it is to end up in position 2. You can work out the path by hand, maybe by drawing a tree of all possible 27 combinations of swaps: you’ll find that 8 outcomes lead to element 0 in position 2, while 10 outcomes place it position 3.
  
  Or just simulate it.
  1. Daniel Lemire says:
    
    June 8, 2019 at 12:21 am
    
    Thank you Travis.
    1. Travis Downs says:
      
      June 8, 2019 at 3:08 am
      
      It’s actually interesting to look at the pattern for arrays of say 10 elements:
      
      naive: 0 1 2 3 4 5 6 7 8 9 a[0] 10.0 10.1 9.9 10.0 9.9 10.0 10.0 10.0 10.0 10.1 a[1] 12.8 9.4 9.3 9.5 9.6 9.7 9.8 9.8 9.9 10.1 a[2] 12.0 12.4 8.9 9.2 9.3 9.3 9.5 9.8 9.8 9.9 a[3] 11.2 11.5 12.1 8.8 8.9 9.2 9.4 9.5 9.7 9.8 a[4] 10.5 10.9 11.4 11.9 8.6 8.8 9.1 9.3 9.7 9.9 a[5] 9.8 10.3 10.8 11.3 11.7 8.6 8.9 9.1 9.6 10.0 a[6] 9.0 9.7 10.1 10.5 11.3 12.0 8.6 9.2 9.6 10.0 a[7] 8.6 9.2 9.7 10.1 10.8 11.3 12.0 9.0 9.4 9.9 a[8] 8.2 8.5 9.0 9.7 10.1 10.8 11.7 12.3 9.6 10.1 a[9] 7.8 8.1 8.8 9.1 9.7 10.4 11.2 11.9 12.7 10.2
      
      Notice the diagonal stripe of higher probabilities running from the top left the bottom right. It can be discribed as “position N is are much more likely to end up with the value that started in a position < N than >= N”. So for example, a[1] is much more likely to get the value that started in a[0] than any other value.
      
      The intuition is this. Consider the second to last step in the loop, where are swapping a[1] with a randomly selected element. Since a[1] swaps with any element with equal probability, after this step a[1] has 0.1 probability of having any specific value, and in particular it has 0 with with p=0.1.
      
      Now, we do the last swap for a[0]. At this point a[0] has not necessarily been swapped, unless it was selected randomly in an earlier step. You can show that this means it will still have zero ~38% of the time (limit of (1-1/n)^n as n->inf). Now you swap that value with another position. Position a[1] will be selected with p=0.1. So the final probability that a[1] ends up 0 is the 0.1 it was zero after the second to last step (and didn’t get swapped in the last step), plus the chance it was selected in the last step: 0.1×0.9+.38×.1 = 0.128 (12.8 has shown in % in the table).
      
      Once that is clear the rest of the table makes sense: each position has a higher probability than uniform of ending up with values in earlier positions, and then the opposite must be true for later values for everything to add up.
      
      This bias doens’t go away as you increase the size of the array (I had originally thought it would). In fact, it gets more extreme if you compare the most biased elements, i.e., for larger arrays some outcomes have a 2x higher probability than others.
      1. Travis Downs says:
        
        June 8, 2019 at 3:09 am
        
        Code blocks don’t preserve spacing, and don’t look the same as the preview 🙁
      2. Travis Downs says:
        
        June 8, 2019 at 3:11 am
        
        Another try with leading zeros to keep format:
        
        naive: 0 1 2 3 4 5 6 7 8 9 a[0] 10.0 10.0 10.0 10.0 10.1 10.0 10.0 09.9 10.0 10.0 a[1] 12.8 09.4 09.4 09.5 09.6 09.6 09.8 09.9 09.9 10.0 a[2] 12.0 12.4 09.0 09.1 09.2 09.4 09.5 09.7 09.8 10.0 a[3] 11.2 11.6 12.1 08.7 08.9 09.1 09.2 09.5 09.7 10.0 a[4] 10.4 10.9 11.3 11.9 08.6 08.8 09.1 09.3 09.7 10.0 a[5] 09.8 10.2 10.7 11.2 11.8 08.6 08.9 09.2 09.5 10.0 a[6] 09.2 09.6 10.1 10.6 11.2 11.8 08.7 09.1 09.5 10.0 a[7] 08.7 09.1 09.6 10.1 10.7 11.4 12.1 09.0 09.5 10.0 a[8] 08.1 08.6 09.1 09.6 10.2 10.9 11.6 12.4 09.5 10.0 a[9] 07.7 08.2 08.7 09.2 09.7 10.5 11.1 12.0 12.8 10.0
      3. Jörn Engel says:
        
        June 8, 2019 at 5:59 pm
        
        Very nice. Thank you!
onkobu says:

June 8, 2019 at 7:21 am

Wouldn’t it be sufficient, to drop the first 4 lines (incl. if) and setting l = 0 as start condition with a foot-controlled do-while? (first calculation block looks the same like the while-body).
1. Daniel Lemire says:
  
  June 8, 2019 at 2:00 pm
  
  Can you point to actual code?
svpv says:

June 8, 2019 at 10:12 pm

So in nearlydivisionless, is l >= s the necessary and sufficient condition to produce an unbiased value? Why do you even switch to another algorithm then, which requires division, if you can simply try again without division? I.e. would the following work?

uint64_t ratherdivisionless(uint64_t s) { while (1) { uint64_t x = random64(); __uint128_t m = (__uint128_t) x * (__uint128_t) s; uint64_t l = (uint64_t) m; if (l >= s) return m >> 64; } }
1. Daniel Lemire says:
  
  June 9, 2019 at 4:16 pm
  
  There are many possible algorithms. If you follow the link to Melissa’s post, you’ll find several. You may find an unbiased algorithm that is faster. That’s entirely possible. But I submit to you that you need to provide a proof of correctness and benchmarks to demonstrate that it is faster.
  1. Travis Downs says:
    
    June 9, 2019 at 7:58 pm
    
    I think you can prove that svpv’s point is a good and general one, without getting your hands dirty (i.e., testing it).
    
    If you have a “rejection sampling” method (because that’s what this is), which uses a fast method taking F time that succeeds with probability p, and on failure falls back to a slow method that takes S time, isn’t the fastest fall back method just to do the same fast sampling method? If it isn’t, it contradicts the idea that the rejection sampling method was faster in the first place (you should just use the “slow” fallback in that case, since it’s actually faster). Probably you can show it formally just by expanding the infinite sequence Fp + S(1-p) substituing the whole thing again for S.
    
    The difference is microscopic in this case for most interesting s since the first attempt succeeds with overwhelming probability except for very large s, however it has the advantage being simpler in implementation, smaller in code size and of not needing to justify the bias-free nature of two different methods.
    
    As a practical matter, however, you need the special fallback case because the l < s approach fast path will blow up as s approaches 2^64. Few people are going to pass that, but you don’t want to take a billion years or whatever when they do. It’s kind of too bad though, if you could just use 65 bits instead of 64 for that check, you wouldn’t need the fallback case.
    1. Travis Downs says:
      
      June 11, 2019 at 4:16 am
      
      This comment is wrong: it should be ignored.
      1. Daniel Lemire says:
        
        June 11, 2019 at 1:54 pm
        
        I can delete this comment
        
        https://lemire.me/blog/2019/06/06/nearly-divisionless-random-integer-generation-on-various-systems/#comment-411232
        
        if you’d like.
        
        Travis Downs says:
        
        June 11, 2019 at 2:04 pm
        
        Nah, maybe there is a useful tidbit in there.
        
        Actually I stand by the overall point, but I misread the code so it definitely doesn’t apply here.
        
        KWillets says:
        
        June 11, 2019 at 5:14 pm
        
        The problem of large s is quite valid. At a certain point a basic rejection method becomes cheaper, because you’re doing the % on a large proportion of values.
        
        The divisionlessness is unfortunately a function of how many excess random bits we use.
        
        Travis Downs says:
        
        June 11, 2019 at 5:37 pm
        
        You could do a check for s greater than some threshold and use a simple technique.
        
        A check right at the start would work but would be in the fast path. Is it possible to defer the check until after the first (l < s) check? That would move it out of the fast path and make it promising.
        
        I think for practical reasons small s dominate, especially with 64-bit values, but yeah it would be nice to elegantly handle the large s case too.
        
        KWillets says:
        
        June 11, 2019 at 7:01 pm
        
        One way might be to add bits to the whole calculation when we hit l < s. Basically divide the bottom bucket into 2^b buckets (using b more random bits), reevaluate the product and l, and hope for l >= s in the wider calculation.
        
        I don’t think this biases it; it just reevaluates at a higher resolution.
        
        Travis Downs says:
        
        June 11, 2019 at 7:21 pm
        
        Yeah I mean for say 32-bit s on a 64-bit machine that approach is even easy to do up front, use say a 40-bit l instead of 32-bit and you don’t have to worry about big s any more.
        
        With 64-bit muls and the result split across two registers it’s not easy though.

KWillets says:

June 10, 2019 at 8:04 pm

No, it needs to check t.

The range for any given output j contains either floor(2^64/s) or ceil(2^64/s) multiples of s. The check on l < s merely finds if l is the first multiple in the range; the check on t figures out if there’s room for one more value at the end.

Travis Downs says:

June 10, 2019 at 10:29 pm

Why do you need to check t? You can simply throw out any values that values fall into the range [s, t] without checking if it was possible to extract an unbiased value from it.

In fact, that’s exactly what Daniel’s code does on the first iteration: it only checks against s. Passing check is sufficient, although not necessary.
1. KWillets says:
  
  June 10, 2019 at 11:38 pm
  
  I don’t understand your wording, so I may misunderstand, but I see a bias remaining if the l < s rejection is the only one applied.
  
  The goal is to get an equal number of accepted values within each interval, where we only use every s-th value, and the offset of the first value can vary.
  
  Given an interval [0, L) we get the most values if the offset is 0:
  
  0, s, 2s, … ks < L == k*s + t
  
  We get one fewer value by starting at t:
  
  t, s+t, … (k-1)s+t < L == ks+t
  
  So in the former case (or at any offset < t) we have to reject one value, but not if t <= l < s.
  
  In other words, if we chop off the first s positions we still have a varying number of values to the right.
  1. Travis Downs says:
    
    June 11, 2019 at 3:29 am
    
    You are right, I hadn’t actually considered the math and misunderstood what what was going on.
    
    I thought the rejection condition was l < s and then everything after that was just a different algorithm for generating another value (which is why I was confused why a different algorithm would be used), but as you point out the true rejection criteria for both the initial and retry loops is l < t.
    
    The l < s is just a fast path to see if we even need to bother computing t. After we compute t the second l < t check is applied and the value may not be rejected after all.
    
    Once you’ve computed t you might as well stay in the inner loop with the l < t condition, since there is no advantage anymore of the two tier check (indeed, it would be slower). So that answers the original query of svpv:
    
    So in nearlydivisionless, is l >= s the necessary and sufficient condition to produce an unbiased value?
    
    No, it is not sufficient. The check l >= t is the sufficient condition (strictly speaking it is not necessary for most s, since you could also choose some reduced interval and reject more values, but that would be dumb).
    
    Thanks for setting me straight.

Travis Downs says:

June 18, 2019 at 4:19 pm

Someone (not me) should submit a fix to std::uniform_int_distribution. At least in libstd++ it looks like a total dog in this respect, doing at least two divisions per random value generated:

if (__urngrange > __urange) { // downscaling const __uctype __uerange = __urange + 1; // __urange can be zero const __uctype __scaling = __urngrange / __uerange; const __uctype __past = __uerange * __scaling; do __ret = __uctype(__urng()) - __urngmin; while (__ret >= __past); __ret /= __scaling; }

So something like 100+ cycles for generating a random int one many modern machines?

Travis Downs says:

June 19, 2019 at 12:05 am

… and by “fix” I mean use this multiplicative technique.

KWillets says:

August 5, 2019 at 10:44 pm

Following some of my other comments on this thread, I put together a version of this that falls through to a multiply instead of a division if the initial check fails. It took me a bit of fumbling with basic arithmetic, but I eventually realized that extending to a 64-bit fraction only requires starting with the 32-bit version and a similar cascade through a couple of cheap-to-check conditions.

See the RangeGeneratorExtended here:
https://gist.github.com/KWillets/b6f76894c115e41339bcb8d34bf9ea49

me says:

September 16, 2019 at 11:26 am

Since Java does not have an 128 bit integer type:

Is it safe to “downgrade” this to output 32 bit random integers by using 64 bit where you used 128 bit, and 32 bit where you used 64 bit?

me says:

September 16, 2019 at 11:43 am

Is there any difference to https://lemire.me/blog/2016/06/30/fast-random-shuffling/ besides now being published?
1. Daniel Lemire says:
  
  September 16, 2019 at 12:09 pm
  
  The blog post you comment upon is entitled “Nearly Divisionless Random Integer Generation On Various Systems”. The post you refer to only benchmarked on one system (processor).
Daniel Lemire says:

September 16, 2019 at 12:10 pm

It is safe for 32-bit integers, yes.