30th June 2016, 25 min read

Fast random shuffling

35 thoughts on “Fast random shuffling”

Eric ArnebÃ¤ck says:

June 30, 2016 at 6:29 am

Nice article!

Do you know if it is possible to implement an even faster shuffle on the GPU, using something like CUDA?
1. Daniel Lemire says:
  
  June 30, 2016 at 1:56 pm
  
  It is important to distinguish a random shuffle from a “fair” random shuffle as discussed. For a fair random shuffle, I do not know a good way to benefit from GPU execution.
2. Joseph Hoover says:
  
  February 11, 2018 at 11:32 pm
  
  it is not because you cannot parallize routines that depend on shared memory if that shared memory is being used by other threads for swapping routines
Demofox says:

June 30, 2016 at 2:28 pm

Appologies for double posting this, but it applies here too in a way that might not be obvious! On the “Picking N distinct numbers at random” post i mention you can use “format preserving encryption” to get those N distinct numbers. You can actually also use it to get distinct numbers [0,N) which means that the numbers you get out are the index for a shuffle. Since encryption requires a key, it’s seedable to generate multiple shuffles.

In other words, you can use format preserving encryption to make a shuffle iterator that visits each item once and only once, and tells you when you have visited all items. You can do this in O(1), without actually shuffling the list or even storing the whole list in memory at once!

If you care about cache friendliness, it (by itself?) isn’t the way to go, but it can be very good for a very quick, virtually storageless shuffle.

http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/
1. Daniel Lemire says:
  
  June 30, 2016 at 2:38 pm
  
  Yes, that’s an interesting approach.
  
  As noted at the beginning of my blog post, it is remarkably difficult to come up with algorithms that produce a fair shuffle. You have to demonstrate that every possible permutation is equally likely. If you drop this fairness requirement, then various faster alternatives become possible.
George Kangas says:

June 30, 2016 at 6:32 pm

About that piece of C code: in line 3, is “newpos” supposed to be “p”?
1. Daniel Lemire says:
  
  June 30, 2016 at 6:45 pm
  
  Thanks for paying attention. Yes. It was a typo.
Marc Reynolds says:

June 30, 2016 at 9:33 pm

I don’t understand the:
“no map from all 32-bit integers to a range can be perfectly fair”

If the source is assumed to be uniform and the mapping to a smaller range is unbiased then what is not “fair” about the result?
1. Marc Reynolds says:
  
  June 30, 2016 at 9:48 pm
  
  Ah! Nevermind you’re saying in general..mapping a single value to the smaller range. I find the wording off…I was assuming you were talking about a rejection method.
Jim Apple says:

July 1, 2016 at 2:23 pm

Have you considered submitting patches for libstdc++, libc++, glibc, etc.?
1. Daniel Lemire says:
  
  July 1, 2016 at 2:49 pm
  
  @Jim
  
  Does glibc have a function providing a fair random shuffle?
  
  As for the C++ libraries, that could be interesting.
  1. Daniel Lemire says:
    
    July 1, 2016 at 2:50 pm
    
    Of course, one problem that one can encounter in such cases is that standard libraries may depend on really slow random number generators (for good or bad reasons). That can make the point irrelevant since the bottleneck is not in the shuffle function itself.
  2. Jim Apple says:
    
    July 3, 2016 at 4:11 pm
    
    strfry isn’t fair, but it is random.
Erich says:

July 25, 2016 at 11:20 pm

Is there a similar approach for generating doubles in [0;1[?

Java generates a 53 bit long, then does a division by (double)(1L << 53).
1. Daniel Lemire says:
  
  July 25, 2016 at 11:43 pm
  
  Are you sure that a division is computed by the CPU? I suspect that Java might be smart enough to turn the division into a multiplication.
  1. Erich says:
    
    July 26, 2016 at 5:48 am
    
    Not that I know of – as far as I know, it can improve Java performance to write *0.5 instead of /2 if it is a highly critical path (it may not often make a difference because of cache wait times). In general 1/N is usually no longer exact representable by a double; so this transformation only works for powers of two without loss IIRC.
    
    I plan on doing some benchmarks with JMH; but if you have some other ideas, I’d be happy to know.
    1. Daniel Lemire says:
      
      July 26, 2016 at 1:13 pm
      
      What would prevent the compiler from turning “/ (double)(1L << 53)" into " * (1 /(double)(1L << 53))" where " (1 /(double)(1L << 53))" can be computed at compile time?
      1. Erich says:
        
        July 27, 2016 at 9:17 am
        
        The Java compiler is not meant to do any such optimization – I believe by design they want the compiler to only translate into bytecode, and have HotSpot to do such optimizations. But I don’t think it ever does division to mulitplication optimizations. I believe ProGuard may be able to do such though at the bytecode level.
        
        In general, it will not be possible: if I’m not mistaken /3 and *(1/3) may not produce the exact same results due to numerical issues. (powers of 2 are; so for above case it is possible – but I could not find such transformations in hotspot).
        
        Daniel Lemire says:
        
        July 27, 2016 at 2:39 pm
        
        I think you are correct that the Java compiler won’t do this optimization (though, admittedly, I did not check). But I do not think it is difficult as clang appears to do it automagically with C code when dividing by a power of two.
        
        This being said, Java does benefit from this optimization in the end because it is hand coded in ThreadLocalRandom (in the Java 8 version I checked).
        
        I suspect that you got â€œ/ (double)(1L << 53)" by looking at the Random class, but that's an hopelessly slow class, almost by design. Please see my other blog post on this topic: http://lemire.me/blog/2016/02/01/default-random-number-generators-are-slow/
        
        So I think it does answer your original query, does it not?
        
        Erich says:
        
        July 27, 2016 at 4:17 pm
        
        Yes, in particular at compile time this should be rather easy to do. I can imagine that some of these easy things just add up too much when you do this at runtime.
        
        But from initial benchmarks, I may be wrong and HotSpot does optimize this to no measureable difference; or CPU pipelineing makes up for these differences.
        
        Daniel Lemire says:
        
        July 27, 2016 at 5:18 pm
        
        CPU pipelineing makes up for these differences
        
        Your benchmark might have other bottlenecks… but a division is massively more expensive than a multiplication so it has to show in the end result if you don’t have much else going on. Pipelining does not help much with the division because it has both a low throughput and a low latency.
        
        My own test seems to indicate that Java is smart enough to turn a division by a power of two into a multiplication:
        Benchmark Mode Samples Score Error Units m.l.m.r.Division.division thrpt 5 115439397.412 Â± 65683.557 ops/s m.l.m.r.Division.divisionBy3 thrpt 5 60568277.782 Â± 13316.776 ops/s m.l.m.r.Division.divisionBy3ThroughMultiplication thrpt 5 115443682.953 Â± 91673.580 ops/s m.l.m.r.Division.precompDivision thrpt 5 115446809.914 Â± 75424.232 ops/s
        
        https://github.com/lemire/microbenchmarks/blob/master/src/main/java/me/lemire/microbenchmarks/random/Division.java
        
        In this particular instance, we see that the throughput is diminished by a factor of two. I would expect a larger difference, but my microbenchmark has limitations.
Damian Vuleta says:

September 10, 2016 at 2:15 pm

“What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.”

The Fisher-Yates algorithm appears to me to be inherently sequential. How might parallelisation or SIMD be applicable here (apart from the random number generation)?
1. Daniel Lemire says:
  
  September 10, 2016 at 4:33 pm
  
  I think that you are correct. Fisher-Yates appears inherently sequential.
  1. Damian Vuleta says:
    
    September 12, 2016 at 12:12 pm
    
    Which makes me wonder: how many other shuffling algorithms which might be parallelisable (is that a word?) are as thorough, and anywhere near as simple, as Fisher-Yates?
    1. Daniel Lemire says:
      
      September 12, 2016 at 1:03 pm
      
      You can easily parallelize shuffling in the following manner. During a first pass, send each of the values to one of N buckets, by picking a random integer in [0,N). Then shuffle each bucket using Fisher-Yates. Finally, aggregate the shuffled buckets: take all values from the first bucket, then all values from the second bucket and so forth.
      
      One major defect of such an alternative is that it is not in-place.
      1. Damian Vuleta says:
        
        July 25, 2021 at 10:15 am
        
        You were quite right:
        https://blog.janestreet.com/how-to-shuffle-a-big-dataset/
Pauli says:

December 26, 2016 at 8:20 pm

Parallel shuffle can be done inplace using partitions (equal size balanced “tree”, unbalanced remainder requires separate shuffling). Partitions are shuffled with Fisher-Yates in parallel. After that one can use random element swap merge to combine partitions with reducing number of threads. Merge can be optimized with simd with store mask or blend.
Partitions improve cache utilization for data sets that are larger than cache. While merge with linear memory access pattern should use memory bandwidth more efficiently than random access shuffle. This will be slower for small data sets when cache provides enough bandwidth to make Fisher-Yates very fast.

But after reading your numbers I’m pretty happy with my scalar shuffle because it can shuffle 52 elements (64 bit ints, aka a pack of cards) and reduce to 4 mask elements in 220-240 cycles. I’m currently using lcg generator and multiplicative bound reduction including bitwise-and optimization for power of two ranges. Removing division was the largest optimization (2200 cycles to 450 cycles).

Second largest improvement was taking stack a temporary stack copy of random number generator (450->280). Without generator copy c++ compiler generated loads and stores for each random number. Loads and stores removed fairly few instructions but those were bottlenecks allowing average instructions per cycle raise from 2.8 to 3.4.

I end up here looking for simd replacement for Fisher-Yates after my attempt to use vpermq to shuffle 4 element partitions resulted to slower code than scalar shuffle. sse/avx permutations have problem that permutation selecting must be immediate value. Immediate value force slow jump table conversion from random number to permutation operation. Result was about 25% reduction to instruction count but 60% reduction to IPC.
1. Daniel Lemire says:
  
  December 26, 2016 at 11:29 pm
  
  @Pauli
  
  Thanks for sharing your numbers. Very interesting.
  
  Parallel shuffle can be done inplace using partitions (equal size balanced â€œtreeâ€, unbalanced remainder requires separate shuffling).
  
  My blog post is concerned with a “fair” random shuffles where all possible permutations (N!) must be equally likely. If you divide up the inputs into fixed-sized blocks, shuffle them, and remerge, you will not get a fair shuffle. That might be fine, depending on your needs, but if you allow biased shuffles, it opens up many opportunities, in general, depending on the needed quality.
  1. fewer herb says:
    
    November 16, 2018 at 1:20 am
    
    This is a bit late, but a fair “merge” shuffle is possible: https://arxiv.org/abs/1508.03167
Albert Chan says:

May 27, 2018 at 7:04 pm

What is the “no division, no bias” method actually called ?
If I googled for it, what should be entered ?
1. Daniel Lemire says:
  
  May 29, 2018 at 1:39 am
  If you need a formal reference, please see:
  - Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation (to appear)
JJShao says:

April 17, 2022 at 4:24 pm

Is the distribution unbias when “leftover >= range” (without div)?
I mean, take range = 5 & bit length = 4 for example, when random4bit range from 0 to 15, we have:
0, 4, 7, 10, 13 is rejected, 1-3 return 0, 5-6 return 1, 8-9 return 2, 11-12 return 3, 14-15 return 4.
which means 0 has 3/16 chance to return without div, but 1-4 has 2/16 chance to return without div, and 5/16 chance we need 1 div(which afterwards is unbias).
Isn’t this bias?
1. Daniel Lemire says:
  
  April 17, 2022 at 5:59 pm
  
  Please refer to the manuscript for the mathematical analysis https://arxiv.org/abs/1805.10941
  1. JJShao says:
    
    April 18, 2022 at 4:34 am
    
    I’ve read that. But while the inner “while” loop does satisfied the lemma 4.1, the outer “if” does not since range != 2^L mod range.
    1. Daniel Lemire says:
      
      April 18, 2022 at 3:39 pm
      
      The algorithm is correct. It provides unbiased results.