Daniel Lemire's blog

, 22 min read

Testing non-cryptographic random number generators: my results

22 thoughts on “Testing non-cryptographic random number generators: my results”

  1. M.E. O'Neill says:

    Hi Daniel,

    I’m glad that working on PRNG testing. Not enough people are! But I think you need to distinguish between systemic failures (which have really low p-values very close to zero) and occasional blips were there is an “unlikely” p-value. Blips must happen from time to time. Three runs is ample to show a repeatable systemic failure, but not nearly enough to say that a blip occurs too often, you’ll most likely be seeing false patterns in random noise.

    1. Marc Reynolds says:

      Another option is to use Adaptive Crush which if it sees a questionable p-value on a given test it increases the number of samples and reruns that test. It “rinse and repeats” up to a limit looking for pass/fail.

      (http://www.math.sci.hiroshima-u.ac.jp/[email protected]/MT/ADAPTIVE/)

  2. M.E. O'Neill says:

    I emailed Daniel and pointed out to him that he was erroneously repeating the same test with the same seeds. I’m pleased to see he has updated his post.

    I think it’s important, though, to be clear about how to interpret the p-value results from TestU01. Here’s a quote from the TestU01 manual

    “When a p-value is extremely close to 0 or to 1 (for example, if it is less than 10^−10), one can obviously conclude that the generator fails the test. If the p-value is suspicious but failure is not clear enough, (p = 0.0005, for example), then the test can be replicated independently until either failure becomes obvious or suspicion disappears (i.e., one finds that the suspect p-value was obtained only by chance). This approach is possible because there is no limit (other than CPU time) on the amount of data that can be produced by a RNG to increase the sample size and the power of the test.”

    Thus the proper terminology for many of the p-values Daniel got is *suspicious*. As L’Ecuyer notes, any time you get a suspicious p-value, you are obligated to investigate. Arguably the current code in Daniel’s test suite doesn’t do this part well enough, because he just reruns the whole test three times. That’s not enough investigation!

    As the author of PCG, so if someone thinks they see an issue, I have an obligation to investigate. I have looked into the specific suspicious p-values Daniel observed for specific tests, and here are the results I get repeating those same tests:

    Running CollisionOver (BigCrush Test #9) on pcg32 twenty times, I see these p-values: 0.05811, 0.144, 0.9842, 0.797, 0.9561, 0.4523, 0.3093, 0.1909, 0.9385, 0.6832, 0.9418, 0.1008, 0.06458, 0.1211, 0.4523, 0.6231, 0.03508, 0.9561, 0.2548, 0.8652, which averages out to be 0.4964.

    Running Run, r = 0 (BigCrush Test# 38) on the byte reverse (not the usual bit-reversed) of the least significant 32-bits of pcg64 lsb twenty times, I see these p-values: 0.992, 0.8454, 0.08798, 0.6355, 0.8053, 0.1355, 0.06824, 0.25, 0.1758, 0.4459, 0.5009, 0.4907, 0.8229, 0.6984, 0.3183, 0.4428, 0.6964, 0.8723, 0.0706, 0.7646, which averages out to be 0.506.

    Thus, we can conclude that Daniel was just seeing some normal variation, not an actual problem.

    Likewise, Daniel’s claims about TestU01 finding flaws in SplitMix are similarly false, merely what we normally see in thorough testing where many tests are applied, and some of the fails he reports for Vigna’s generators are also not indicators of serious problems.

    I’ve only just started (re)checking SplitMix to make sure it gets a clean bill of health too, but once it’s done I’ll post again with a pastebin link for all the results.

    Even though Daniel made a few mistakes (and gave me a bit more excitement than I was expecting this morning!), I applaud the idea of more widespread testing of PRNGs in general. We’ve already seen that mistakes are easy to make, and in that context, we really want multiple sources of independent validation, not just PRNG authors posting results. (You might hope that if the work has been formally published, that peer-reviewers would do this kind of work, but independently replicating someone’s results, and checking over their code for bugs, is above and beyond what typically happens in that context.)

    1. Thanks for your comments. I will be updating my results this week.

      1. M.E. O'Neill says:

        Here are links to my retest results for PCG and SplitMix:

        * pcg32 — https://pastebin.com/g9Fe2KdZ
        * pcg64 — https://pastebin.com/MTsGm9uN
        * splitmix — https://pastebin.com/ut80JMMz and https://pastebin.com/gPP4NA88

        As expected, like the earlier retesting I reported for PCG, these retests show, as expected, that no real flaw had been discovered in SplitMix via simple TestU01 testing.

        (But, regarding the statistical quality of SplitMix, it does, however, only produce each output *once* (when used as a 64-bit PRNG). Once you’ve seen a given 64-bit value, you’ll never see it again. That means it fails a really simple 64-bit birthday-problem test, due to lack of repeats. But actually observe this issue, you need either lots of time or lots of memory to observe 64-bit repeats (i.e., you must remember every 64-bit value you have seen so far). I have a custom tester to check for plausible behavior, and it uses 100 GB of RAM to perform the test. Obviously that’s not a test everyone can run (although you can use smaller sizes if you’re willing to wait longer), but for SplitMix, you don’t actually have to run the test to know that it’ll fail, since it’s an integral part of the design. FWIW, Vigna’s PRNGs, pcg64, pcg64_fast, a 128-bit MCG, and std::mt19937_64 all pass a simple 64-bit birthday test. As with any statistical issue, you might decide it’s not important for your application. And in some specialized situations, the once-only property is actually desirable, which is why PCG reluctantly provides some variants (explicitly labelled with the suffix _once_insecure) to provide it.)

        (It’s also the case that Daniel tested Vigna’s cut-down version of SplitMix, which drops its multi-stream functionality and the split() function that inspires its name.)

        1. Marc Reynolds says:

          “Once you’ve seen a given 64-bit value, you’ll never see it again.”

          Unless I’m missing the point, I see this as unimportant since for a generator with period 2^64 (or 2^64-1), we can only expect good stat quality with sample sizes < sqrt(2^64) = 2^32 for not birthday spacing related and b-day spacing no more than (2^64)^(1/3) ~= 2642246

          1. M.E. O'Neill says:

            If it really were the case that you could only demand 2^32 values from SplitMix, that would itself be a problem as well because on a modern machine we can produce 2^32 values well under 10 seconds. A modern PRNG that is only good for fewer than 10 seconds of output would be questionable at best.

            The old rule about only being able to expect about sqrt(period) values from any PRNG is not the case for many/most modern PRNGs, and especially not cryptographically secure ones based on hashing a counter (e.g., Random123). The period of those is based on the width their internal counter, and they should give randomness that passes statistical tests all the way until the counter loops back.

            SplitMix passes PractRand’s statistical tests up to at least 64 TB of output (that’s as far as I’ve run it so far), 2048x more than the 32 GB fail we’d see if it failed at the square root of its period.

            In general though, I think we can agree that using a 64-bit PRNG with 2^64 period to produce 64-bit outputs is clearly problematic. We can agree that there will be birthday issues far sooner than the period might have you think.

            In contrast, if you use SplitMix as a 32-bit PRNG, everything will seem as it should.

            1. Marc Reynolds says:

              And the cubic root of period is only an LCG thing..bad me.

              Oh, and thanks for putting the effort into writing a paper accessible to a wide audience. Certainly practitioners appreciate the extra effort.

        2. Thank you. I will rerun everything from scratch. I tried patching up my results with a partial rerun, but it is too easy to mislabel files when proceeding this way. I will just precisely capture the output of the script on a single machine. This will hopefully clear all confusions.

          Thanks again!

    2. “Running CollisionOver (BigCrush Test #9) on pcg32 twenty times, I see these p-values: 0.05811, 0.144, 0.9842, 0.797, 0.9561, 0.4523, 0.3093, 0.1909, 0.9385, 0.6832, 0.9418, 0.1008, 0.06458, 0.1211, 0.4523, 0.6231, 0.03508, 0.9561, 0.2548, 0.8652, which averages out to be 0.4964.

      Running Run, r = 0 (BigCrush Test# 38) on the byte reverse (not the usual bit-reversed) of the least significant 32-bits of pcg64 lsb twenty times, I see these p-values: 0.992, 0.8454, 0.08798, 0.6355, 0.8053, 0.1355, 0.06824, 0.25, 0.1758, 0.4459, 0.5009, 0.4907, 0.8229, 0.6984, 0.3183, 0.4428, 0.6964, 0.8723, 0.0706, 0.7646, which averages out to be 0.506.”

      The mean is a fairly useless metric for repeated tests. What you want to do is a goodness of fit for a U(0, 1) distribution.

      p for the first series (Test #9) of p-values [octave’s implementation of the one-sided KS-test, kolmogorov_smirnov_test(x, “unif”, 0, 1)] is ~0.476, nothing suspicious.

      p for the second series (Test #38) is 0.992 which at least would lead me to run that particular test more times.
      (Anderson-Darling gives a somewhat high but not very high p-value for the second series, 0.96.)

  3. Alois Bauer says:

    I don’t get it: JavaDoc says “Instances of SplittableRandom are not cryptographically secure. Consider instead using SecureRandom in security-sensitive applications.” – so what is your point?
    Crack SecureRandom and you will get my attention.
    And about “widespread”: Please show me any relevant code using SplittableRandom in Java….

    1. Instances of SplittableRandom are not cryptographically secure. Consider instead using SecureRandom in security-sensitive applications.

      That is correct.

      so what is your point?

      I’m not sure what you are asking.

      Maybe you think it is a post about cryptography?

      My blog post is specifically about “non-cryptographic RNGs”, please see the title. The tests I run are specifically designed for non-crytographic RNGs.

      Crack SecureRandom and you will get my attention.

      Again, this post has nothing to do whatsoever with cryptography. We test non-cryptographic random number generators.

      Please show me any relevant code using SplittableRandom in Java

      SplittableRandom is part of the core Java API. It is difficult to be more widespread.

  4. Curious how Mersenne Twister compares to the above (later?) random number generator. Maybe not important.

    1. Mersenne Twister passes practically random, but I don’t have the results yet from “big crush”. From what I read elsewhere, it fails. You can certainly run the tests yourself if you are curious: I encourage you to do so.

      1. M.E. O'Neill says:

        The Mersenne Twister did not fail in Daniel’s test because he told PractRand to stop after 64 GB, which is about half an hour of testing. By default, PractRand tests up to 32 TB, which obviously takes considerably longer.

        The 32-bit Mersenne Twister will fail at 256 GB of output and its 64-bit version will fail at 512 GB of output (which takes three hours to reach). It fails PractRand’s BRank test. You can find sample output showing it fail in the “How to Test with PractRand” tutorial on my blog.

        So, a more precise answer from Daniel would be to say that it didn’t fail when he tested it, but he only ran a limited test. Saying that it “passes PractRand” (in general) is false; it does not.

        [FWIW, a 74 bit Lehmer PRNG fails PractRand at about the same point at the Mersenne Twister (yes, I meant 74, not 64).]

        1. That’s correct. It only passes the test as one can find it in the repository. So it is one seed, and only 64 GB. This does not catch all possible failures.

  5. M.E. O'Neill says:

    Hey Daniel,

    Looks like testingRNG/testu01/results in your github repo still has files from when you were accidentally reusing the same seed for every run (perhaps because your test script dumps results files in testingRNG/testu01, and not testingRNG/testu01/results ?).

    In particular, the results for testpcg64-S848432-b-r.log and testpcg64-S987654-b-r.log are just duplicates of testpcg64-b-r.log. When you correct this issue, the actual runs for these seeds are completely free of any blips and the “run failures” you think you’re seeing for pcg64 disappear.

    1. M.E. O'Neill says:

      Also, I suspect that the reason you thought SplitMix failed was because your Xorshift RNG misidentified itself as SplitMix. If you remove those misattributions, I think SplitMix will come out okay.

      1. My script deliberately does not overwrite files in the results subdirectory (that would be inconvenient as they are checked into the repository). However, it is correct that there were issues with these archived log files. So, as stated in a previous comment, and in the revised blog post, I am re-running everything from scratch to hopefully get a clean and fully reproducible result dump. Thanks for warning me about these problems.

      2. You are probably right about SplitMix, and once I get the final results, I might do a blog post specifically about SplitMix.

  6. Lin says:

    Are you aware of any generators which pass TestU01 that use 32-bit arithmetic exclusively? That is, composed of integer type uint32_t. Basically something like **xoshiro128**** but.. actually passes TestU01. PCG seems to be exclusively 64-bit. 64-bit arithmetic is notoriously cumbersome and slow in JavaScript. Alea was made for JS, and supposedly passes BigCrush but no one seems to be testing it/heard of it.

    1. Lin: You can read the original TestU01 article that introduced Crush, BigCrush, etc.

      P. L’Ecuyer and R. Simard, “TestU01: A C Library for Empirical Testing of Random Number Generators”, ACM Transactions on Mathematical Software, 33, 4, Article 22, August 2007, 40 pages.

      It contains a large table with test results for many 32-bit generators. You will find there that the Mersenne Twister fails linearity tests in BigCrush. But there are also generators that pass all the tests and (more importantly) are also supported by solid theory.