Daniel Lemire's blog

, 44 min read

Doubling the speed of std::uniform_int_distribution in the GNU C++ library (libstdc++)

49 thoughts on “Doubling the speed of std::uniform_int_distribution in the GNU C++ library (libstdc++)”

  1. KWillets says:

    Following my comments on earlier posts, I put together a repo with a few variations on this method which might be interesting.

    One observation after plugging in std::random_device is that random bits are expensive, and another direction might be to minimize how many we use. Shuffle, for instance, can keep track of the bit width of its index as it works its way down, and rejection within just that many bits may prove faster due to lower entropy consumption.

    With pseudorandom generators of course it’s different, so we almost need different algorithms for the different sources.

    1. foobar says:

      Wouldn’t bit-optimal use of entropy effectively correspond with arithmetic decoding of a compressed bitstream? How hard that would be? I guess there would be a spectrum from fast and bit-expensive to slow and (almost?) bit-optimal solutions to such a problem.

      1. KWillets says:

        AC did seem relevant, and it might be good to look through the different solutions. I have a feeling that left-to-right infinite-precision multiplication is already well known somewhere, but I haven’t found a reference.

        1. If someone wants to work through this, I am available.

          I think that there are papers on using as few random bits as possible, I read a few but I did not pay attention because that’s not what interested me at the time.

          1. foobar says:

            I have a preliminary implementation which consumes “optimal” amount of whole bits per call. (It’s not based on AC, though.) There are two variants: one that is easy to understand, and another that’s a little hairier but branch predictor friendly and a bit faster, although it conceptually does lot of unnecessary work.

            The branch predictor friendly version would seem to have overhead of about 9 ns per call for small non-power-of-two values on my couple year old 2.9 GHz laptop. I still have to clean up my code…

          2. foobar says:

            I have experimented with some code that uses bits a bit more efficiently: a variant of “bitmask with rejection” algorithm (to which the “mostly avoiding divisions” algorithm transforms when variable number of bits are used, I believe), and a “long multiplication” algorithm. Of both there are naive and loop-unrolled versions (latter to please branch predictors).

            I think the bitmask with rejection algorithm consumes log(4)log2(n) bits on average for large n, and the long multiplication algorithm does alway use less than log2(n)+2 on average for a specific value of n, the number of possible result values. Thus, the long multiplication algorithm uses the bit material more efficiently, although for almost all values of n it imposes a two-bit overhead over optimal fractional bit algorithms, and actually loses to the bitmask with rejection algorithm when n is little less than an integer power of two.

            My oversimplified benchmark (including xorshift64s and bit management) spends around 4.5 ns per call on naive bitmask w/ rejection function, and 4.3 ns on unrolled variant (for n=24; these can vary significantly depending on value of n). In the case of long multiplication algorithm corresponding numbers on my 2.9 GHz ’16 laptop are 11.3 and 9.2 ns.

            My code is hopefully readable enough: https://pastebin.com/5c8zMk6B

            1. foobar says:

              I fixed my long multiplication routine, and interestingly enough the non-unrolled variant is now faster, although it’s slower than than before. It now takes about 12.4 ns per call.

              I think this doesn’t really reflect performance of the long multiplication algorithm itself (which should be have run time independent of n on average, and shows with BOGUS_RANDOMNESS on my code to drop typical call time to <4 ns), but rather management overhead of the random bits pool. It might be that the numbers would be much better if one would acquire randomness in larger batches than 64 bits at a time and shift it from the pool in a branchless manner.

        2. foobar says:

          This problem is awfully painful to think about without rational bignums if one wants to handle fractional bits (in the style of arithmetic coding), but if we choose to consume an whole (but possibly variable) number of input bits for every output value it becomes probably more tractable. (I leave the implementation as an exercise to the reader as it’s quite late evening on my time zone.:)

          1. Travis Downs says:

            “Efficient” AC is widespread in compression algorithms, and doesn’t need any bignum stuff. You just don’t get “exact” AC but rather just precision down to the number of bits you have (there are some tradeoffs e.g., with how often you flush).

            So I don’t see a problem to adapt those algorithms to the shuffle case: the analogy is more or less exact. However they all need one division per value, so I have my doubts.

            That said, I think Daniel’s technique works with any number of bits, right? So you don’t need a different algorithm to be efficient down to the bit: AC would only get you sub-bit precision, and your generator would have to be really slow for that to pay off, I think.

            We’ll that’s not strictly correct, since in the retry case Daniel’s algorithm will use another set of bits: Daniel is there a formula for the number of bits used in expectation for a given range?

            1. foobar says:

              Okay, bad terminology (and I was also pretty tired).

              One thing I fear on using efficient arithmetic coding methods is introducing some of the skew that these routines are intending to avoid. Practical AC doesn’t need to care about that so much.

              Reasoning about expectation of number of bits on the case of code above is an interesting question. I think it should be easier to reason at least about a case of and “infinite-precision” binary fixed point number presenting range between 0 and the number of alternatives (non-inclusive). What is the expected amount of bits of such a random number that determines the random integer without uncertainty? The best case would be log2(maxval), the worst would be infinity for almost all values of maxval, I think.

              I wonder if such a routine could be turned into something akin to Bresenham’s algorithm in practice. Relying on single bit input steps would probably make that ugly on the branch predictor, though…

              1. Travis Downs says:

                Really good point about the bias. Of course in compression you don’t really care since it’s just a microscopic difference in the compression ratio but for this type of thing it would be a big deal.

                1. foobar says:

                  I suspect working around bias turns into nasty bignum math very quickly. On the other hand, if you want to generate uniformly distributed numbers on the same, relatively small (< square root of native register) range s, you can simply search uniform integers over range s^n (smaller than register range) using the routine I present on godbolt link below, and later chop them relatively efficiently to individual results using precomputed division/residual magic constants.

                  This drops expected number of entropy bits used for range 3 from 3 bits to 1.63 on 64-bit register implementation (optimal would be 1.58), for instance.

            2. You approximately have a geometric distribution with p = 1 – s/2^L where s is your range. I think that the expected number of trials before you have a success is 1/p, so 2^L/(2^L-s)… You consume L bits per trial, so it is about 2^L/(2^L-s) * L bits consumed to generate a random number in [0,s).

              You can solve for the optimal value of L. It is going to be higher than log2(s), evidently. My expectation is that the optimal L is not going to be very far from log2(s). But I am sure someone can derive a solid bound.

              Sorry, I have not done the math yet.

              1. foobar says:

                Some empirical averages (of million random runs each) on used bits I observed on my code for 1 to 64 values:

                {0., 1., 2.99913, 2., 3.68596, 3.75048, 4.09385, 3., 4.56261, 4.56261, 4.7658, 4.62472, 5.00652, 5.03252, 5.21029, 4., 5.53107, 5.53158, 5.62542, 5.56342, 5.71718, 5.71846, 5.82045, 5.62477, 5.9298, 5.93728, 6.04736, 5.9685, 6.16978, 6.17932, 6.27255, 5., 6.51585, 6.51485, 6.56376, 6.53185, 6.60912, 6.60959, 6.65559, 6.56258, 6.70191, 6.70274, 6.75153, 6.71835, 6.79677, 6.79703, 6.84769, 6.62494, 6.90202, 6.90547, 6.95634, 6.92183, 7.0106, 7.01561, 7.07071, 6.96757, 7.12957, 7.13322, 7.18957, 7.14959, 7.25198, 7.2558, 7.30239, 6.}

                The plot doesn’t suggest it would be too obvious, but yes, it’s like 2-based logarithm and some overhead…

              2. foobar says:

                No wonder my plots seemed odd, I had a bug on my code. But when corrected, it’d seem that average number of bits stays under log2(n + 1) + 2 bits for outputs of n alternatives.

              3. foobar says:

                Average number of bits required for s alternatives (with a bit of help from OEIS in the form of mystery sequence A104594) is, at least for my code which consumes whole bits:

                x = ceil(log2(s)) + bitand(s, s-1)/2^(ceil(log2(s))-1)

                And for s >= 1: ceil(log2(s)) <= x < ceil(log2(s))+2

    2. The “bits are expensive” scenario is interesting. Note that the existing two-division approach does not use fewer bits… you know this, but a casual reader may not.

      In the case where bits are infinitely expensive, I guess the solution is known.

      1. KWillets says:

        The existing thing is suboptimal in just about every way.

        With expensive bits, the rightward extension idea may be adaptable to consuming only a few at a time, eg an 8-bit random times a 32-bit range. It would trade more adds and multiplies (every 8 bits instead of every 32) for fewer random bits consumed.

      2. foobar says:

        Here’s a rough idea: precompute magic constants which convert divisions into multiplications on initialization of std::uniform_int_distribution. Maintain reasonable amount of usable bits of entropy (like 64 bits or something), and rescale a full native type of entropy bits to output range (using magic multiplicative constants). This is the preliminary result. Now, multiply this again by another magic constant to get a fixed-point fraction corresponding to “exact” value of bits which would represent this value on the native type, and compute xor of this with original entropy bits. If this result is not zero, find the most significant bit set, and return that bit and less significant remaining bits to the entropy pool (these bits were not necessary in deciding the output value) and return the preliminary result as the random value.

        If the result of xor is zero, continue by requesting more random bits and by performing long division algorithm (with above bit-checking) until the arbitrary-precision fraction and entropy bits diverge.

        If my reasoning is correct, this routine shouldn’t demand divisions beyond initialization, no excessively wide multiplications, and should result less than one bit of wasted entropy per generated random value.

        1. Are you assuming that the range does not change dynamically?

          1. foobar says:

            Naively so, or at least that they don’t change often. It is hard to reason how one could perform bit-efficient operations without that, it would eventually turn into implementing divisions in a way or another.

            (Extracting one bit of division result shouldn’t take more than something like 2-3 clock cycles, though, and every such bit would halve the likelihood of need to continue, but combining this with the approach of the blog post is a bit beyond me.)

            1. The benchmark described in my blog post is a Knuth shuffle, so it is a dynamic range.

              1. foobar says:

                That definitely sabotages my approach. 🙂

                Maybe there’s a way to extend the multiplication method somehow, considering that need of extra bits should become increasingly unlikely, bit after bit.

                Bit-by-bit division wouldn’t be awfully hard to implement and it could run until the result would be known to be unbiased, but that would have a cost per produced bit. This might still be faster for many inputs than two hardware divisions (or similar overhead)!

              2. foobar says:

                Here’s an alternative version. This is really pseudocode; I haven’t even tried compiling it. I tested the concept with ranges of 3 and 5 with corresponding Wolfram Language code and 2 and 3 bit “word size”, and it seemed to give unbiased results. So, likelihood of the code having significant bugs is large. (Writing the supporting code was just bigger hassle than translating the idea.)

                uint32_t get_uniform_rand(uint32_t range)
                {
                uint32_t shift = 32 - __builtin_ffs(range);
                uint32_t add = range << shift;
                uint64_t mul;
                uint32_t res;
                uint32_t frac;

                if (range == 0)
                {
                return 0;
                }

                frac = mul = (uint64_t)(get_random_bits(bits) << shift) * range;
                res = mul >> 32;

                /* Can the fractional part still cause res to increment? */
                while (frac > -add)
                {
                uint32_t oldfrac = frac *= 2;

                if (get_random_bit())
                {
                frac += add;
                }

                if (oldfrac > frac)
                {
                /* Wraparound, that is, carry. */
                return res + 1;
                }
                }

                return res;
                }

                The idea here is that first only limited amount of bits contributes to the multiplication, and then multiplication is performed one bit at a time until it’s known that further bits don’t contribute to the output value.

                Biggest problem with modern processors on this is probably branch prediction of the while loop. This code might be faster if the loop would be somehow unrolled couple of times and the point where all necessary bits were seen would be reasoned afterwards.

                1. foobar says:

                  Ehm. “bits” above is __builtin_ffs(range).

                2. foobar says:

                  And __builtin_ffs -> __builtin_clz. Jeez.

                  I think this code actually works: https://godbolt.org/z/pMDK_5

                  1. foobar says:

                    Now it works. Promise!

                    https://godbolt.org/z/q96jQx

                    1. foobar says:

                      I think the bit-by-bit multiplication might be omitted (and full multiplication in the style of the blog post performed) for tracking the number of necessary bits of entropy, and this information could be extracted from the lower multiplication result world, but I have no proof for this, or clear hunch if this could be accomplished efficiently (without a division).

                      This still doesn’t remove the need that sometimes more than one word of entropy is truly needed to get an unbiased result.

  • Thomas Müller Graf says:

    The “bits are expensive” scenario: arithmetic coding could be used, or ANS coding (“asymmetric numeral systems”, see wikipedia). ANS might be faster. The problem is probably correcting bias (I think for both arithmetic and ANS coding). For ANS coding, the generation loop looks something like this (Java), so it only uses multiplication / shift / add:

    private static final long TOP = 1L << 24;
    private static final int SHIFT = 12;
    private static final int MASK = (1 << SHIFT) - 1;

    // read 64 bit from the compressed stream
    long state = r.nextLong();
    // for all output bytes
    for (int i = 0; i < size; i++) {
    // read data from the state
    int x = (int) state & MASK;
    // lookup the code
    int c = freqToCode[x] & 0xff;
    // output
    out[i] = (byte) c;
    // update state depending on output and frequencies
    state = (freq[c] * (state >> SHIFT)) + x - cumulativeFreq[c];
    // if necessary, read from the compressed stream
    while (state < TOP) {
    state = (state << 32) | (r.nextInt() & 0xffffffffL);
    }
    }

    I don’t think tables are needed as frequencies should be equal for each entry in the alphabet. I implemented some ANS code in h2/src/test/org/h2/test/unit/TestAnsCompression.java (https://github.com/h2database/h2database).

  • foobar says:

    There are no casual readers here

  • Tom says:

    Great article. Since these are eventually arithmetic tricks, I hope that one day this will be the level of optimization that the compiler will give us.

    1. I am not sure that the compiler would be allowed to make this optimization since it changes the output.

  • David Taylor says:

    I was interested enough in this that I tried to understand it and make it work. Unfortunately, it didn’t. I found the problem.

    This code is very inefficient with memory, but it shows the problem and the fix.

    The idea is NOT to feed it random numbers, but to feed every number in the input integer size and see how they get binned out. Each output integer should have the an equal number of inputs that resolve to it.

    #include <stdio.h>
    #include <stdint.h>
    #include <string.h>

    int16_t nearlyDivisionlessY(uint16_t x, uint16_t divisor, uint16_t fixed) {
    uint32_t m;
    uint16_t l, v;
    m=(uint32_t)x*divisor;
    l=(uint16_t)m;
    if (l<divisor) {
    if (!fixed) {
    v=-divisor % divisor; //PROBLEM: referenced paper AND proposed kernel patch say use this
    } else {
    v=(uint16_t)(-divisor) % divisor; //FIX
    }
    if (l< v) {
    return -1;
    }
    }
    return m>>16;
    }

    #define MYMAX (UINT16_MAX+1)
    uint16_t bins[MYMAX];
    uint16_t bbins[MYMAX];

    int main(int argc, char *argv[]) {
    uint16_t i,divisor,fix;
    uint16_t b,base;
    for (fix=0; fix<=1; ++fix)
    for (divisor=6; divisor<=6; ++divisor) {
    if (!(divisor&(divisor-1))) continue; //skip powers of 2
    printf(" %u %s 0x10000 %% %u = %u\n",divisor, (fix?"fixed":"original"), divisor, (uint16_t)(0x10000UL % divisor));
    memset(bins,0,sizeof(bins));
    memset(bbins,0,sizeof(bbins));
    i=0;
    do {
    b=nearlyDivisionlessY(i,divisor,fix);
    if (0<=b)
    bins[b]++;
    } while (++i);
    i=0;
    do {
    bbins[bins[i]]++;
    if (bins[i]) {
    printf("%5u %5u\n", i, bins[i]);
    }
    } while (++i<divisor);
    i=0;
    do {
    if (bbins[i]) {
    printf("%5u %5u\n", i, bbins[i]);
    }
    } while (++i);
    }
    return 0;
    }
    /* output:
    6 original 0x10000 % 6 = 4
    0 10923
    1 10923
    2 10922
    3 10923
    4 10923
    5 10922
    10922 2
    10923 4
    6 fixed 0x10000 % 6 = 4
    0 10922
    1 10922
    2 10922
    3 10922
    4 10922
    5 10922
    10922 6
    */

  • foobar says:

    Another question on generating pseudorandom numbers on non-power-of-two interval: would it possible to implement PRNGs that emit uniformly distributed pseudorandom integers while evolving their state efficiently on conventional binary logic and performing fixed amount of work between outputs? Would there be a trivial method to construct such a PRNG for a specific interval size?

    Listed unbiased transforms from power-of-two PRNGs fail on the “fixed amount of work” part, and evolving the state as trits instead of bits, for instance, is unlikely to be particularly efficient to implement on conventional CPUs.

    1. foobar says:

      Well, I should have thought it through…

      A simple uniform ternary toy PRNG (cycle length 243 == 3^5) might be, for instance:

      #include <stdint.h>

      uint8_t state;

      uint8_t random3(void)
      {
      state = (7 * state + 154) % 243;
      return state / 81;
      }

      1. Travis Downs says:

        If it’s any consolation if you hadn’t given me a minute to answer your question, I also would have though “no you can’t do it in guaranteed fixed time”, by analogy with the superficially similar problem “given a source of random bits, generate a random value between 1 and N, where N is not an integer power of 2” (or several equivalent formulations).

        At least for that problem (the one the post was about, I guess), I don’t think you can do it with a guaranteed fixed number of bits? Of course the expected number of bits is finite, due to the magic of exponential series, but you still need an arbitrary number of bits at least for some cases, IIRC.

        So I would have thought something similar here, but as you point out, obviously not.

        It makes sense though: the operators +, %, * aren’t inherently “binary” so it’s not too weird I guess that they work fine for non-power-of-two: it’s only when your underlying random source provides bits that the pow-2 becomes inherent. Or something handwavy like that.

        1. foobar says:

          I guess I agree (in equally handwavy ways) on pretty much all of this.

          I think it’s intuitive that for non-integer powers of two you can’t guarantee an upper bound for number of bits that would produce a result when the input source is base-2, but the mean number of bits required, for any finite range is bounded. In an earlier comment I concluded that the long multiplication method should work with less than ceil(log2(n)) + 2 bits per call on average for any value (and at best log2(n) for integer powers of two, of course).

          1. foobar says:

            I think the bit efficiency overhead (in practice, the rounding up of log2(n) and the + 2 part) could be largely eliminated for values of n much smaller than the word size: one could simply keep the seen random bits from the previous round, re-feed them to new calculation where n = n_new * n_previous, subtracting preceding value multiplied by n_previous from the result, and repeating this as long as n fits in the word size. This could provide bit efficiency approaching log2(n), about 3% (or 1/32) worse than optimal in the case of 64-bit words. The computational overhead in comparison with the naive implementation would be something like two multiplications and two additions per call.

            (This would be effectively a computation which would incrementally compute the result for range n_1 * n_2 * … * n_x and peel off values for individual values of n at the same time.)

            Thus one can get quite close to optimal bit efficiency wile maintaining unbiased results, without a need to resort to arbitrary precision arithmetic or divisions, at least when typical n is significantly smaller than the native word size – which should really be the common case for such a function.

            1. Travis Downs says:

              This paper is also interesting:

              Optimal Discrete Uniform Generation from Coin Flips, and Applications

              They work in a model where you consume randomness one bit at a time (your only source of randomness is a coin-flip), and the primary algorithm (FASTDICEROLLER) for generating values in a range is also division free.

              Naively, it requires at least ~log(N) flips for a range of size N, but the initial ~log(N) flips could simply be done at all once by extracting that many bits (for these initial flips the v >= n condition is always false so it just amounts to accumulating all the random bits in c).

              Performance is not terrible, but I suspect depends heavily on the exact value of n. If it is just above a power of two, the first “meaningful” flip (that can be accepted or rejected, i.e., where v >= n) will be rejected with probability a bit less than 50%, so you’ll probably take a misprediction. For values just under a power of two, rejection chance is close to zero and it should be very fast.

              I prefer Lemire’s algorithm because in practice you’d probably want to use a few more bits (but not 64!) to reduce the rejection chance and improve performance, and his provides the way to do that while remaining division free.

              1. foobar says:

                Interesting find!

                On a quick glance the paper would seem to do pretty much those things I ranted on these comments a year ago. I believe the algorithm is effectively equivalent to what I investigated, although my interest was achieving reasonable real-world performance while maintaining optimal bit usage.

                I did attempt some optimizations to tame branch predictor issues, but in the end they were not particularly effective – then again, if you are really limited by the speed of “coin flips” (a physical randomness source) and want to use those bits efficiently, maybe algorithmic speed is not of the highest importance. Decreasing bit efficiency dramatically reduces those speed penalties, anyway.

                I believe my most recent work on this is here: https://pastebin.com/5c8zMk6B – but it doesn’t involve remembering past requests in order to increase bit efficiency, which could be implemented relatively easily.

                1. Travis Downs says:

                  Yeah, the FDR approach in the paper isn’t exactly optimal either but they do provide a more complicated approach that is optimal, but I think the basic approach is close enough.

                  Your linked approach and the approach in the paper both share the interesting characteristic that I forgot to mention above that they don’t even need a multiplication. For machines with expensive multiplications but cheap branch mispredictions (many microcontroller-ish things qualify) this might be a big benefit.

                  1. foobar says:

                    Yeah, multiplication is technically just a shortcut. Microcontrollers, particularly if the target range is particularly small, could do better without, although I guess cheap multiplication is available on most “modern” microcontrollers (then again, there’s plenty of legacy like PICs etc.).

                  2. foobar says:

                    Actually, looking at this a bit more closely… my algorithm is not optimal on the usage of bits! Mean overhead for n=3 is only third of a bit and drops from there, but nonetheless! Interestingly enough algorithm of taking two random bits and repeating if value is 3 is the most efficient solution for this exact problem, expected bits at 8/3 (while mine is 3). I think this is a bit surprising since in one fourth of times two bits are completely discarded…

  • Jake says:

    I understand the academic interest in absolutely uniform distribution but I think that for all ***reasonable intents and purposes**** just doing prng() % range is good enough. prng being an instance of mt19937_64 or some such.

    ****reasonable intents and purposes*** means the output range is much smaller than the prng range. And the number of trial is also small relatively to the full 64-bit range so you would not notice the lower part of the output range having a teeny-tiny higher chance of occurring.

    So we can save ourselves the branch for chopping off the top of the input range.

    Now for the division, or modulo, you can use libdivide, fastmod, or any any other similar impl.

    Please tell me where I am wrong. Thanks!

    1. You may not be concerned with biases and in your own use cases, that might be warranted. However, existing functions in standard libraries were bias-free. The proposed function, which has now been adopted in several standard libraries, replaced bias-free functions by a bias-free function while improving the performance.

  • Shawn Silverman says:

    I recently implemented this algorithm, and it works great, except for when using it with std::minstd_rand. For that case, I get a uniform distribution over only half the requested range. For example, if I request a range of 10, this should result in numbers in the range 0-9, however, I’m only seeing results in the range 0-4.

    std::mt19937 works fine, as does using a hardware-based entropy generator, with the exact same “Lemire algorithm” code.

    Is this algorithm known to have issues when used with LCG-type generators as a source (or maybe I’m over-generalizing)?

    Here’s some test code that demonstrates the problem:
    https://gist.github.com/ssilverman/216c0e3f31e0a644abb300561bb733c7

    I tested this using Xcode 14.3.1 (Apple clang version 14.0.3 (clang-1403.0.22.14.1)) and on a Teensy 4.0 (Teensyduino 1.58, GCC 11.3.1).

    1. Is this algorithm known to have issues when used with LCG-type generators as a source (or maybe I’m over-generalizing)?

      The assumption is that you have a generator that is uniform over the 32-bit or 64-bit range. You are using a generator that is not uniform over the range of values supported by uint32_t. By default, it will only generate values in the range [0,2^31-1). So it is not appropriate.

      1. Shawn Silverman says:

        That makes total sense. That was quite a thing to overlook on my part. 🙂

        Now I have to figure out why the other algorithm in that code (randomNonLemire()), which I believe also has a 32-bit full range assumption, gives expected results over the whole requested range (0-9), even if the pseudo-random numbers have a 0-(2^31-1) range.

        1. Shawn Silverman says:

          For future readers: I settled on using std::uniform_int_distribution because it likely takes into account all the cases I didn’t think of, such as whether the random source has “full precision”. I’m fortunate to be using systems that have this class.