, 42 min read
The fastest conventional random number generator that can pass Big Crush?
43 thoughts on “The fastest conventional random number generator that can pass Big Crush?”
, 42 min read
43 thoughts on “The fastest conventional random number generator that can pass Big Crush?”
While Lehmer’s multiplicative congruential generator is fast, it does NOT pass PractRand 0.94 (but you know this already). In does it beat the minimal standard and too big to fail Melissa O’Neill tested these generators with PractRand 0.93, where they passed.
Consider an idea from George Marsaglia instead: save and add the “carry” of the multiplication, giving a multiply-with-carry generator:
uint64_t seed = 0, carry = 0x...;
uint64_t square_with_carry() {
__uint128_t mwc = (__uint128_t) seed * 0x... + carry;
seed = mwc;
carry = square >> 64;
return seed;
}
Or a square-with-carry generator:
uint64_t seed = 0, carry = 0xb5ad4eceda1ce2a9;
uint64_t square_with_carry() {
__uint128_t square = (__uint128_t) seed * seed + carry;
seed = square;
carry = square >> 64;
return seed;
}
This 64-bit generator passes the PractRand test suite (version 0.94) at least up to 4 TB!
See source and source for the code generated by GCC 8.2 and clang 7.0 respectively, plus VC source for an implementation using Microsofts Visual C compiler (which does not support a 128 bit integer type).
I used the latter for the PractRand test.
I turned your comment into an issue on GitHub: https://github.com/lemire/testingRNG/issues/14
I was trying to understand why Lehmer’s with 3 generators is more performant on A53, while single generator perfectly loads pipeline with something about 9 cycles consumed and 12 cycles latency (this is calculated by A57 reference, but should be relevant for A53 too). 3 instructions are required:
umulh x9, x22, x23
mul x22, x22, x23
madd x19, x19, x23, x9
Then I realised that test uses different iterations count, so basically it tests branch prediction efficiency. After running the fixed version results became more reasonable:
wyrng 0.009253 s
splitmix64 0.007591 s
lehmer64 0.007093 s
lehmer64 (2) 0.00733 s
lehmer64 (3) 0.007269 s
Next we do random number computations only, doing no work.
wyrng 0.011513 s
splitmix64 0.008451 s
lehmer64 0.00664 s
lehmer64 (2) 0.006763 s
lehmer64 (3) 0.006762 s
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/30
Excellent, thanks.
Running the PRNGs in a tight loop just summing their results is typically not seen in practice: real-world applications perform some (more) computation with the random numbers.
Consider to change your timing setup a little bit and perform some (silly, but yet sufficient) computation, like the MonteCarlo determination of pi:
unsigned hits = 4000000;
for (unsigned i = hits; i > 0; i--)
{ const unsigned x = prng(), y = prng();
const unsigned long long z = ~0ULL * ~0ULL;
hits -= z - (unsigned long long) x * x < (unsigned long long) y * y;
}
printf("Pi = %lu.%06lu\n", hits / 1000000, hits % 1000000);
Running the PRNGs in a tight loop just summing their results is typically not seen in practice
It is not. Indeed. You are correct.
What is up with ‘z’ in your code? Why not just write ‘1’?
The (f)ull and true square of ~0UL alias UINT_MAX is not ‘1’, but ~0ULL * ~0ULL or (~1ULL << 32) | 1ULL: (x-1)(x-1) = xx – 2*x + 1. While precision surely does not matter for this MonteCarlo approximation, I don’t want to give a bad or questionable example.
wyhash64 is half-awesome, half-questionable.
It is similar to PCG in that it combines a weak base generator and an output function to create a strong generator. One problem with PCG is that the base generator’s critical path is using MUL+ADD, which takes 4 cycles on Intel. No matter how fast the output function, performance cannot exceed one output per 4 cycles.
wyhash64 uses ADD for the base generator, latency 4 cycles. That’s awesome. Base generator is weaker than an LCG, but we can compensate with a stronger output function. The carry bits still result in some amount of randomness that we can “amplify” by multiplications.
Questionable is the output function. XOR of the low and high multiplication result is not an invertible function, so some outputs are impossible to generate. See for yourself:
long total = 0;
for (int m = 1; m < 65536; m += 2) {
uint8_t hit[65536] = { 0, };
for (int i = 0; i < 65536; i++) {
uint32_t state = i;
state *= m;
state ^= state >> 16;
hit[state & 0xffff]++;
}
int hits = 0;
for (int i = 0; i < 65536; i++)
hits += !!hit[i];
printf(“%5d: %5d\n”, m, hits);
total += hits;
}
printf(“average: %5ld\n”, total / 32768);
Not sure how bad the problem is for 64bit multiplication, but I would expect about half of all 64bit numbers to be absent from the output. Arguably that is bad. Or good, depending on one’s view.
The bad side is pretty obvious. The good side is that collisions in the output functions means random numbers get generated multiple times before the 64bit counter loops. So this PRNG doesn’t fail a birthday paradox test, in spite of having 64bit state and 64bit output.
Assuming you consider the non-invertible output function to be bad, that should be relatively easy to fix. Use regular 64bit multiplication and xorshift.
state ^= state >> 32;
state *= M1;
state ^= state >> 32;
state *= M2;
state ^= state >> 32;
Something like the above should work. It should also be faster on Arm, 32bit x86 and some other architectures. We might get away with a single multiplication if we use a random xorshift like PCG does. I should try some variant and see if they survive practrand.
XOR of the low and high multiplication result is not an invertible function
This is mathematically intuitive.
Based on fairly quick tests, the two fastest output functions that might work (up to several Gigabytes at least) are:
state ^= state >> 32;
state *= M1;
state ^= state >> 32;
state *= M2;
state ^= state >> 32;
and:
old ^= old >> ((old >> 59) + 5);
old *= M1;
old ^= old >> ((old >> 59) + 5);
I don’t think it makes a difference whether you use the same or two different multipliers. You can also make the multiplier and the increment of the base generator the same. In a way, the increment is just another multiplier anyway.
My usual rules for the multiplier are that it has to be odd (obviously) have a popcount of roughly 32 and should avoid long stretches of 0 or 1. Pick your favorite commit hash, skip any hex digits that are 0 or f and ensure the last bit is 1 and you have a good multiplier.
And because the precise multiplier doesn’t matter very much, it is easy to have multiple independent PRNG using different multipliers. Multipliers still have to be checked, but the check is much easier than a multi-dimensional spectral test for an LCG.
It is similar to PCG in that it combines a weak base generator and an output function to create a strong generator.
Better compare it with Fortuna-like designs, which apply a block cipher or hash function to a simple counter: the first part of wyhash64 is an arithmetic progression alias Weyl sequence.
It differs from other generators which use a simple multiplicative (non-cryptographic) hash function by ‘folding’ the high part of the product back into the output.
16 months since this article was published, so here is a small update on this subject. I posted a proposal for a PRNG (tylo64) over at the numpy github in June:
https://github.com/numpy/numpy/issues/16313#issuecomment-641897028
It is heavily inspired by sfc64, but with improvements that makes it even quite a bit faster than wyHash (see below), and at the same time robust for massive parallel usage, due to the added user specified Weyl increment. I.e. each k (odd number) guaratees a unique period of 2^64 (but average ~ 2^94). This feature practically removes the need for a jump function (which sfc64 and tylo64 is missing).
Statistically, I have ran PractRand to 4 TB without issues – also the 32bit variant of it as recommened by Melissa O’Neill. Note also, this implementation does not rely on hardware support for fast multiplication or 128-bit arithmetic, like wyhash.
In numpy, they have now added my proposed Weyl sequence to the original sfc64 (requires 320 bits state): https://bashtage.github.io/randomgen/bit_generators/sfc.html
My proposed PRNG has 256-bit state, and is faster probably because it updates only 196-bit state per number, vs 256-bit in sfc64, as they do essentially the same number and types of operations. Real world Monte Carlo test follows, as suggested by Stefan Kanthak:
// Output on AMD Ryzen 7 2700X cpu:
// tylo64: Pi = 3.14153238: 0.719000 secs
// wyhash64: Pi = 3.14167794: 1.003000 secs
#include <stdint.h>
#include <time.h>
#include <stdio.h>
static uint64_t wyhash64_x;
static uint64_t tylo64_x[4];
uint64_t wyhash64() {
wyhash64_x += 0x60bee2bee120fc15;
__uint128_t tmp;
tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
uint64_t m1 = (tmp >> 64) ^ tmp;
tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
uint64_t m2 = (tmp >> 64) ^ tmp;
return m2;
}
uint64_t tylo64() {
enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
const uint64_t b = tylo64_x[1], result = tylo64_x[0] ^ (tylo64_x[2] += tylo64_x[3]|1);
tylo64_x[0] = (b + (b << LSHIFT)) ^ (b >> RSHIFT);
tylo64_x[1] = ((b << LROT) | (b >> (64 - LROT))) + result;
return result;
}
#define COMPUTE_PI(prng) { \
hits = N; \
clock_t diff, before = clock(); \
for (unsigned i = hits; i > 0; i--) { \
const uint32_t x = prng(), y = prng(); \
hits -= z - (uint64_t) x * x < (uint64_t) y * y; \
} \
diff = clock() - before; \
printf("Pi = %lu.%06lu: %f secs\n", hits / (N/4), hits % (N/4), (float) diff / CLOCKS_PER_SEC); \
}
int main()
{
size_t N = 400000000, hits;
const uint64_t z = ~0ULL * ~0ULL;
wyhash64_x = tylo64_x[0] = tylo64_x[1] = tylo64_x[2] = tylo64_x[3] = time(NULL); // seed
COMPUTE_PI(tylo64)
COMPUTE_PI(wyhash64)
}
I don’t know wether people from Numpy are aware of that, but in SFC64 the least significant bits for counter-like initialization for different streams are extremely correlated. Try to initalize SFC64 by some fixed s[3] = { 5, 11, 13 } and use weyl additive constans to initalize different streams: 1, 3, 5, 7, … Now we could get uncorrelated results, starting from first ouput. But least significant bits for first outputs for consequtive streams are looping: 1010101… Similarly is in case of higher bits.
PractRand would not detect it, if we are would test only few interleaved streams. But this correlations are there. So we can’t initialize this generator this way if we want to avoid correlations. These correlations quickly disappear as more results are generated. But how many results must be skipped to avoid correlation, or how to initialize the generator to avoid such problems? Nobody took care of it.
I suspect the same problem with your generator.
I might be missing something from reading the article, but why does the wyhash64() C function in the post differ from the linked Swift implementation? They seem like two different algorithms.
https://github.com/lemire/SwiftWyhash/blob/master/Sources/SwiftWyhash/SwiftWyhash.swift
There are different wyhash functions.
Note that I did not invent these functions.
Do they both pass the same statistical tests? It does not appear that the testingRNG repo mentions it or contains results for (the smaller) wyrand, which has become more common in the wild from what I’ve seen (presumably due to increased perf/reduced size).
Have you tested the xoroshiro family? According to Vigna’s results, they are the fastest:
On the other hand, its results differ significantly from this:
https://github.com/lemire/testingRNG
I wonder what is faster, truncated LCG or Vigna’s xoroshiro128+? By the way, why does his results show such advantages on PCG, and according to your results, xoroshiro generators do not have such advantages (relatively)? According to your results, these:
– xoroshiro128plus: 0.83 cycles per byte
– pcg64: 0.97 cycles per byte
are almost the same. According to Vigna results xoroshiro128+ is almost 3 times faster than PCG64.
Not sure if this was xorshift or xoroshiro, but I used one of them and eventually learned the hard way that the PRNG had two sequences of numbers. One of them was covering all numbers except zero, the other was… zero.
That is a rather nasty failure mode. As a naïve user, one of the most natural initialization values for your PRNG state would be zero. Zero also happens to be the one value that leads to catastrophic failure – every random number you generate is zero. Any software where the most natural way of using it leads to catastrophic failure should be avoided. Even if you know the pitfall and make sure to use proper initialization values, someone else will modify your code in the future and expose the catastrophic failure again.
In the PractRand implementation, SFC64 has a seed function which skips 12 outputs, and the author discusses the required number:
“16 outputs skipped – sfc64 is held to a higher standards for seeding than sfc32 because it is rated for more parallel scenarios
no wait, it’s rated the same as sfc32, there’s no reason for 16 rounds, stick to 12”
BTW. I have abandoned my above generator, and now use a variant which is very similar to SFC64, but with an extra 64-bit state to generate 2^63 unique threads, and a different output function. This still needs to skip some initial rounds if the seed is weak. Interestingly, this is as fast or faster than SFC64 with the clang compiler.
uint64_t STC64(uint64_t s[5]) { // s[4] must be odd
enum {LR=24, RS=11, LS=3};
const uint64_t result = (s[0] ^ (s[3] += s[4])) + s[1];
s[0] = s[1] ^ (s[1] >> RS);
s[1] = s[2] + (s[2] << LS);
s[2] = ((s[2] << LR) | (s[2] >> (64 - LR))) + result;
return result;
}
Armv8 has UMULH instruction for getting bits [127:64] of the 64×64 multiplication, so it should be the same amount of the instructions, as on x64.
On x64, you need a single instruction to generate the 128 bits of the product of two 64-bit values. As far as I know, on ARM, you will need to multiplication instructions (mul and umulh) to get the same result, and the umulh instruction is sometimes much slower than the mul instruction.
Total amount of instructions is same, but on the arm it is 3 multiplications: umulh, madd, mul. On x64 it is: mulx, imul, add.
Indeed umulh is slower than mul, but not dramatically. Execution throughputs are 1/4 vs 1/3 on Cortex A57.
From Faster Remainder by Direct Computation:
Quoting directly from the ARM documentation:
That’s correct, but from the same doc: latency for 64×64 MUL is 5 cycles and 3 cycles M pipeline stall. 5 cycles latency does not look much faster than 6.
Cyril… We go from two multiplications (on x64) to three multiplications, including one that is particularly expensive. The instruction count might be the same, but we know not to determine the speed of a piece of code to its instruction count when some instructions are much more expensive than others.
See my follow-up blog post: ARM and Intel have different performance characteristics: a case study in random number generation. I show that wyhash, which is so good on x64, is no longer too great on ARM.
I don’t report the Lehmer’s results in this following blog post… but maybe you will trust me, after reading that other blog post, if I tell you that neither Lehmer’s nor wyhash are competitive on my ARM server.
I’ll be glad to give you access to my ARM server if you want to prove me wrong…
BTW Cortex A75 is better example where umulh is slow. While on A57 it is more or less the same, as regular mul in terms of latency and cpu cycles, on A75 it is exactly 2 times slower, then mul. https://static.docs.arm.com/101398/0200/arm_cortex_a75_software_optimization_guide_v2.pdf (page 16)
Unfortunately I don’t have access to the real A75 hardware and can’t run your tests.
Yes: in Lehmer’s generator, dependency chain is 5 cycles (4 cycles for high-part result of low-parts multiplication, plus 1 cycle for adding the result of high×low multiplication), there’s only one multiply unit, and so with only two independent generators the multiplier is not fully utilized: we’re supplying 4 multiply ops each 5 cycles.
Thus, using 3 independent generators improves the situation (without running into undesirable effects like register spilling).
You are correct. I have updated my blog post and credited you.
For people reading this. The processor can issue a multiplication per cycle but it takes three cycles for it to complete. Hence the number three. We need three streams to keep it busy all the time.
And, of course, if the rest of your application needs the multiply unit, you are in trouble.
(This is specific to current x64 processors.)
Uhm. Your answers to Cyril and me are written as if Lehmer’s generator uses a 64×64 bit multiplication, but the code you show actually needs a 128×64 bit multiplication, and there’s no way to express that as one multiplication instruction, not on AMD64, not on AArch64. Check the assembly generated by GCC: you’ll see one ‘mul’ instruction and one ‘imul’ instruction.
My blog post states:
Once compiled for an x64 processor, the generator boils down to two 64-bit multiplication instructions and one addition.
That’s for x64. Cyril stated that on ARM, you’d have the same number of multiplications. I don’t think that’s correct. For ARM, you are going to need something like three multiplications, one of which will be UMULH, which is slow on the ARM machines I know.
As for my answer to your comment, as long as you have enough registers, it should be obvious that you can issue one multiplication per cycle if you have three generators because of the three cycle latency (assuming you have a superscalar processor with sufficient width to fit in the loads and the adds). It is also obvious from your analysis that two generators are not enough.
My bad — misinterpreted what you said the first time around. Sorry. On AArch64 there should be one more mul (so three in total rather than two as on AMD64).
I still think your follow-up to my comment was somewhat confusing: with two generators, you have four independent multiplications, which is enough to cover the basic three-cycle latency. The problem is that dependency cycle is not 3 ticks (and I suspect I miscalculated earlier and just from nominal latencies the critical cycle is 4 ticks, not 5).
According to Agner Fog, we have the following on recent Intel processors… for register inputs, you have 3 cycles of latency for 64-bit multiplications (one extra cycle for 32-bit but it is irrelevant here).
Of course, if one of the inputs is in memory, then the latency is expected to be longer… but let us ignore this for the time being.
You have a dependency chain of at least 5 cycles as per your analysis. Maybe you want to argue that the dependency chain is longer, fine… but let us agree that it is at least 5 cycles. In 5 cycles, you do two multiplications. So that’s one multiplication every 5/2 = 2.5 cycles. Intel processors can do one multiplication every cycle.
I think it follows that we need to have more than two generators to keep the multiplication unit fully occupied.
Now this is a bit rough because we are making the implicit assumption that we are strictly limited by the multiplication. We probably need to do a better job analyzing the flow of instructions to really understand what is going on.
https://www.agner.org/optimize/instruction_tables.pdf
I don’t see any reason to believe that 3 independently seeded generators would be unsafe compared to using a single generator. If anything, it could be slightly better since there is 3x the state and 3 uncorrelated streams rather than 1.
Of course, I guess you could just bundle up your 3x generator and put it through the available tests like big crush.
That aside, once you accept multiple generators and so effectively measure throughput rather than value-to-value latency, won’t the fastest approach be SIMD? That changes the playing field dramatically (e.g., there aren’t even any 64-bit input multiplications available up to and including AVX2).
I don’t see any reason to believe that 3 independently seeded generators would be unsafe (…)
In this instance, I also cannot imagine what the problem might be, but I thought I’d be prudent and urge people to check before they adopt this!
won’t the fastest approach be SIMD?
Yeah but I excluded SIMD deliberately here.
I also did not discuss latency.
Maybe it got lost in an edit, because another commenter quotes “Can you do better without any specialized instructions” but I don’t see that now.
One could argue that the rules of the game are a bit unclear: for example, when you use 2 or 3 Lehmer generators, you unroll the loop by 2 or 3 and call each in turn in the “user” code – so it’s not really implementing the RNG API, but rather an end-user optimization using separate generators which may be easy or hard to use in practice depending on the situation.
To keep the same API you could wrap the 2 or 3 generators up into one that gives the usual interface and alternates among the generators using an internal flag. In principle a compiler could even undo this packaging as an optimization, if it unrolled the loop body by the right amount and compiled away the control flow – but I think it is unlikely with today’s compilers in most scenarios.
You are correct that I wasn’t clear as to the rules of the game. But we both know that SIMD could kill these scalar functions, as long as the compiler can’t autovectorize well.
Auto-vectorization is SIMD, no?
Well I would say it isn’t totally obvious that SIMD will do much better here since both presented approaches use wide integer multiplications which is a weak spot for x86 SIMD, which doesn’t offer any 64-bit input multiplications at all.
… but yeah I guess SIMD would win perhaps with something like xorshift variant.
When I refer to SIMD, I include auto-vectorization, always. In my comment above, I should qualify… with “programmer’s deliberate use of SIMD”.
Don’t guess, just test it!
On x86-32, 64-bit PRNGs which don’t need a 64*64-bit multiplication or division run a BIT faster when implemented using MMX (or SSE).
If you can run PE32/PE32+ executables: fetch MMX, 32-bit and 64-bit implementation; these run the PRNGs from NOMSVCRT in a tight loop. For a more realistic test, fetch MMX, 32-bit and 64-bit implementation: these run the same PRNGs with the MonteCarlo approximation I already mentioned here.
Right, but which can pass Big Crush from TestU01?
Testing a 64-bit PRNG with TestU01 is NO GOOD IDEA, better use Practically Random instead.
A properly implemented 64-bit PRNG produces the same sequence on x32 and x64 (or any other processor architecture), with or without the use of MMX/SSE/Neon/AVX.
Bob Jenkin’s Small Fast prng64(), the enhanced and true middle-square msqr64(), the hashed arithmetic progression drbg64() (same as Daniel’s adaption of wyprng()), the scrambled MCG smcg64() pass full runs of PractRand up to 32TB.
The square-with-carry I mentioned here passes at least up to 4TB.
Testing a 64-bit PRNG with TestU01 is NO GOOD IDEA, better use Practically Random instead.
How can it possibly be a bad idea? I would agree that you should use Practically Random, but what harm may come from running Big Crush… other than climate change due to CPU heat?
Though I agree that PractRand seems both more modern and possibly all around better… it is certainly easier from a software engineering point of view… TestU01 is the one benchmark we have that is backed by years of peer-reviewed research… and its Big Crush is a “de facto” standard. It is also a static and determined target as long as you describe how you ran the tests (you need to be specific because Big Crush is essentially a 31-bit test).
It’s no good idea (I didn’t say bad idea) because TestU01 was designed for 32-bit numbers, so you need to take quite some precautions or apply spoon-feeding when testing 64-bit PRNGs.