It is important to distinguish a random shuffle from a “fair” random shuffle as discussed. For a fair random shuffle, I do not know a good way to benefit from GPU execution.

it is not because you cannot parallize routines that depend on shared memory if that shared memory is being used by other threads for swapping routines

Appologies for double posting this, but it applies here too in a way that might not be obvious! On the “Picking N distinct numbers at random” post i mention you can use “format preserving encryption” to get those N distinct numbers. You can actually also use it to get distinct numbers [0,N) which means that the numbers you get out are the index for a shuffle. Since encryption requires a key, it’s seedable to generate multiple shuffles.

In other words, you can use format preserving encryption to make a shuffle iterator that visits each item once and only once, and tells you when you have visited all items. You can do this in O(1), without actually shuffling the list or even storing the whole list in memory at once!

If you care about cache friendliness, it (by itself?) isn’t the way to go, but it can be very good for a very quick, virtually storageless shuffle.

As noted at the beginning of my blog post, it is remarkably difficult to come up with algorithms that produce a fair shuffle. You have to demonstrate that every possible permutation is equally likely. If you drop this fairness requirement, then various faster alternatives become possible.

George Kangassays:

About that piece of C code: in line 3, is “newpos” supposed to be “p”?

I don’t understand the:
“no map from all 32-bit integers to a range can be perfectly fair”

If the source is assumed to be uniform and the mapping to a smaller range is unbiased then what is not “fair” about the result?

Marc Reynoldssays:

Ah! Nevermind you’re saying in general..mapping a single value to the smaller range. I find the wording off…I was assuming you were talking about a rejection method.

Jim Applesays:

Have you considered submitting patches for libstdc++, libc++, glibc, etc.?

Of course, one problem that one can encounter in such cases is that standard libraries may depend on really slow random number generators (for good or bad reasons). That can make the point irrelevant since the bottleneck is not in the shuffle function itself.

Jim Applesays:

strfry isn’t fair, but it is random.

Erichsays:

Is there a similar approach for generating doubles in [0;1[?

Java generates a 53 bit long, then does a division by (double)(1L << 53).

Are you sure that a division is computed by the CPU? I suspect that Java might be smart enough to turn the division into a multiplication.

Erichsays:

Not that I know of – as far as I know, it can improve Java performance to write *0.5 instead of /2 if it is a highly critical path (it may not often make a difference because of cache wait times). In general 1/N is usually no longer exact representable by a double; so this transformation only works for powers of two without loss IIRC.

I plan on doing some benchmarks with JMH; but if you have some other ideas, I’d be happy to know.

What would prevent the compiler from turning “/ (double)(1L << 53)" into " * (1 /(double)(1L << 53))" where " (1 /(double)(1L << 53))" can be computed at compile time?

Erichsays:

The Java compiler is not meant to do any such optimization – I believe by design they want the compiler to only translate into bytecode, and have HotSpot to do such optimizations. But I don’t think it ever does division to mulitplication optimizations. I believe ProGuard may be able to do such though at the bytecode level.

In general, it will not be possible: if I’m not mistaken /3 and *(1/3) may not produce the exact same results due to numerical issues. (powers of 2 are; so for above case it is possible – but I could not find such transformations in hotspot).

I think you are correct that the Java compiler won’t do this optimization (though, admittedly, I did not check). But I do not think it is difficult as clang appears to do it automagically with C code when dividing by a power of two.

This being said, Java does benefit from this optimization in the end because it is hand coded in ThreadLocalRandom (in the Java 8 version I checked).

So I think it does answer your original query, does it not?

Erichsays:

Yes, in particular at compile time this should be rather easy to do. I can imagine that some of these easy things just add up too much when you do this at runtime.

But from initial benchmarks, I may be wrong and HotSpot does optimize this to no measureable difference; or CPU pipelineing makes up for these differences.

Your benchmark might have other bottlenecks… but a division is massively more expensive than a multiplication so it has to show in the end result if you don’t have much else going on. Pipelining does not help much with the division because it has both a low throughput and a low latency.

My own test seems to indicate that Java is smart enough to turn a division by a power of two into a multiplication:
Benchmark Mode Samples Score Error Units
m.l.m.r.Division.division thrpt 5 115439397.412 Â± 65683.557 ops/s
m.l.m.r.Division.divisionBy3 thrpt 5 60568277.782 Â± 13316.776 ops/s
m.l.m.r.Division.divisionBy3ThroughMultiplication thrpt 5 115443682.953 Â± 91673.580 ops/s
m.l.m.r.Division.precompDivision thrpt 5 115446809.914 Â± 75424.232 ops/s

In this particular instance, we see that the throughput is diminished by a factor of two. I would expect a larger difference, but my microbenchmark has limitations.

Damian Vuletasays:

“What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.”

The Fisher-Yates algorithm appears to me to be inherently sequential. How might parallelisation or SIMD be applicable here (apart from the random number generation)?

I think that you are correct. Fisher-Yates appears inherently sequential.

Damian Vuletasays:

Which makes me wonder: how many other shuffling algorithms which might be parallelisable (is that a word?) are as thorough, and anywhere near as simple, as Fisher-Yates?

You can easily parallelize shuffling in the following manner. During a first pass, send each of the values to one of N buckets, by picking a random integer in [0,N). Then shuffle each bucket using Fisher-Yates. Finally, aggregate the shuffled buckets: take all values from the first bucket, then all values from the second bucket and so forth.

One major defect of such an alternative is that it is not in-place.

Parallel shuffle can be done inplace using partitions (equal size balanced “tree”, unbalanced remainder requires separate shuffling). Partitions are shuffled with Fisher-Yates in parallel. After that one can use random element swap merge to combine partitions with reducing number of threads. Merge can be optimized with simd with store mask or blend.
Partitions improve cache utilization for data sets that are larger than cache. While merge with linear memory access pattern should use memory bandwidth more efficiently than random access shuffle. This will be slower for small data sets when cache provides enough bandwidth to make Fisher-Yates very fast.

But after reading your numbers I’m pretty happy with my scalar shuffle because it can shuffle 52 elements (64 bit ints, aka a pack of cards) and reduce to 4 mask elements in 220-240 cycles. I’m currently using lcg generator and multiplicative bound reduction including bitwise-and optimization for power of two ranges. Removing division was the largest optimization (2200 cycles to 450 cycles).

Second largest improvement was taking stack a temporary stack copy of random number generator (450->280). Without generator copy c++ compiler generated loads and stores for each random number. Loads and stores removed fairly few instructions but those were bottlenecks allowing average instructions per cycle raise from 2.8 to 3.4.

I end up here looking for simd replacement for Fisher-Yates after my attempt to use vpermq to shuffle 4 element partitions resulted to slower code than scalar shuffle. sse/avx permutations have problem that permutation selecting must be immediate value. Immediate value force slow jump table conversion from random number to permutation operation. Result was about 25% reduction to instruction count but 60% reduction to IPC.

Thanks for sharing your numbers. Very interesting.

Parallel shuffle can be done inplace using partitions (equal size balanced â€œtreeâ€, unbalanced remainder requires separate shuffling).

My blog post is concerned with a “fair” random shuffles where all possible permutations (N!) must be equally likely. If you divide up the inputs into fixed-sized blocks, shuffle them, and remerge, you will not get a fair shuffle. That might be fine, depending on your needs, but if you allow biased shuffles, it opens up many opportunities, in general, depending on the needed quality.

Is the distribution unbias when “leftover >= range” (without div)?
I mean, take range = 5 & bit length = 4 for example, when random4bit range from 0 to 15, we have:
0, 4, 7, 10, 13 is rejected, 1-3 return 0, 5-6 return 1, 8-9 return 2, 11-12 return 3, 14-15 return 4.
which means 0 has 3/16 chance to return without div, but 1-4 has 2/16 chance to return without div, and 5/16 chance we need 1 div(which afterwards is unbias).
Isn’t this bias?

Eric ArnebÃ¤cksays:Nice article!

Do you know if it is possible to implement an even faster shuffle on the GPU, using something like CUDA?

Daniel Lemiresays:It is important to distinguish a random shuffle from a “fair” random shuffle as discussed. For a fair random shuffle, I do not know a good way to benefit from GPU execution.

Joseph Hooversays:it is not because you cannot parallize routines that depend on shared memory if that shared memory is being used by other threads for swapping routines

Demofoxsays:Appologies for double posting this, but it applies here too in a way that might not be obvious! On the “Picking N distinct numbers at random” post i mention you can use “format preserving encryption” to get those N distinct numbers. You can actually also use it to get distinct numbers [0,N) which means that the numbers you get out are the index for a shuffle. Since encryption requires a key, it’s seedable to generate multiple shuffles.

In other words, you can use format preserving encryption to make a shuffle iterator that visits each item once and only once, and tells you when you have visited all items. You can do this in O(1), without actually shuffling the list or even storing the whole list in memory at once!

If you care about cache friendliness, it (by itself?) isn’t the way to go, but it can be very good for a very quick, virtually storageless shuffle.

http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/

Daniel Lemiresays:Yes, that’s an interesting approach.

As noted at the beginning of my blog post, it is remarkably difficult to come up with algorithms that produce a fair shuffle. You have to demonstrate that every possible permutation is equally likely. If you drop this fairness requirement, then various faster alternatives become possible.

George Kangassays:About that piece of C code: in line 3, is “newpos” supposed to be “p”?

Daniel Lemiresays:Thanks for paying attention. Yes. It was a typo.

Marc Reynoldssays:I don’t understand the:

“no map from all 32-bit integers to a range can be perfectly fair”

If the source is assumed to be uniform and the mapping to a smaller range is unbiased then what is not “fair” about the result?

Marc Reynoldssays:Ah! Nevermind you’re saying in general..mapping a single value to the smaller range. I find the wording off…I was assuming you were talking about a rejection method.

Jim Applesays:Have you considered submitting patches for libstdc++, libc++, glibc, etc.?

Daniel Lemiresays:@Jim

Does

glibchave a function providing a fair random shuffle?As for the C++ libraries, that could be interesting.

Daniel Lemiresays:Of course, one problem that one can encounter in such cases is that standard libraries may depend on really slow random number generators (for good or bad reasons). That can make the point irrelevant since the bottleneck is not in the shuffle function itself.

Jim Applesays:strfry isn’t fair, but it is random.

Erichsays:Is there a similar approach for generating doubles in [0;1[?

Java generates a 53 bit long, then does a division by (double)(1L << 53).

Daniel Lemiresays:Are you sure that a division is computed by the CPU? I suspect that Java might be smart enough to turn the division into a multiplication.

Erichsays:Not that I know of – as far as I know, it can improve Java performance to write *0.5 instead of /2 if it is a highly critical path (it may not often make a difference because of cache wait times). In general 1/N is usually no longer exact representable by a double; so this transformation only works for powers of two without loss IIRC.

I plan on doing some benchmarks with JMH; but if you have some other ideas, I’d be happy to know.

Daniel Lemiresays:What would prevent the compiler from turning “/ (double)(1L << 53)" into " * (1 /(double)(1L << 53))" where " (1 /(double)(1L << 53))" can be computed at compile time?

Erichsays:The Java compiler is not meant to do any such optimization – I believe by design they want the compiler to only translate into bytecode, and have HotSpot to do such optimizations. But I don’t think it ever does division to mulitplication optimizations. I believe ProGuard may be able to do such though at the bytecode level.

In general, it will not be possible: if I’m not mistaken /3 and *(1/3) may not produce the exact same results due to numerical issues. (powers of 2 are; so for above case it is possible – but I could not find such transformations in hotspot).

Daniel Lemiresays:I think you are correct that the Java compiler won’t do this optimization (though, admittedly, I did not check). But I do not think it is difficult as clang appears to do it automagically with C code when dividing by a power of two.

This being said, Java does benefit from this optimization in the end because it is hand coded in ThreadLocalRandom (in the Java 8 version I checked).

I suspect that you got â€œ/ (double)(1L << 53)" by looking at the Random class, but that's an hopelessly slow class, almost by design. Please see my other blog post on this topic: http://lemire.me/blog/2016/02/01/default-random-number-generators-are-slow/

So I think it does answer your original query, does it not?

Erichsays:Yes, in particular at compile time this should be rather easy to do. I can imagine that some of these easy things just add up too much when you do this at runtime.

But from initial benchmarks, I may be wrong and HotSpot does optimize this to no measureable difference; or CPU pipelineing makes up for these differences.

Daniel Lemiresays:CPU pipelineing makes up for these differencesYour benchmark might have other bottlenecks… but a division is massively more expensive than a multiplication so it has to show in the end result if you don’t have much else going on. Pipelining does not help much with the division because it has both a low throughput and a low latency.

My own test seems to indicate that Java is smart enough to turn a division by a power of two into a multiplication:

Benchmark Mode Samples Score Error Units

m.l.m.r.Division.division thrpt 5 115439397.412 Â± 65683.557 ops/s

m.l.m.r.Division.divisionBy3 thrpt 5 60568277.782 Â± 13316.776 ops/s

m.l.m.r.Division.divisionBy3ThroughMultiplication thrpt 5 115443682.953 Â± 91673.580 ops/s

m.l.m.r.Division.precompDivision thrpt 5 115446809.914 Â± 75424.232 ops/s

https://github.com/lemire/microbenchmarks/blob/master/src/main/java/me/lemire/microbenchmarks/random/Division.java

In this particular instance, we see that the throughput is diminished by a factor of two. I would expect a larger difference, but my microbenchmark has limitations.

Damian Vuletasays:“What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.”

The Fisher-Yates algorithm appears to me to be inherently sequential. How might parallelisation or SIMD be applicable here (apart from the random number generation)?

Daniel Lemiresays:I think that you are correct. Fisher-Yates appears inherently sequential.

Damian Vuletasays:Which makes me wonder: how many other shuffling algorithms which might be parallelisable (is that a word?) are as thorough, and anywhere near as simple, as Fisher-Yates?

Daniel Lemiresays:You can easily parallelize shuffling in the following manner. During a first pass, send each of the values to one of N buckets, by picking a random integer in [0,N). Then shuffle each bucket using Fisher-Yates. Finally, aggregate the shuffled buckets: take all values from the first bucket, then all values from the second bucket and so forth.

One major defect of such an alternative is that it is not in-place.

Damian Vuletasays:You were quite right:

https://blog.janestreet.com/how-to-shuffle-a-big-dataset/

Paulisays:Parallel shuffle can be done inplace using partitions (equal size balanced “tree”, unbalanced remainder requires separate shuffling). Partitions are shuffled with Fisher-Yates in parallel. After that one can use random element swap merge to combine partitions with reducing number of threads. Merge can be optimized with simd with store mask or blend.

Partitions improve cache utilization for data sets that are larger than cache. While merge with linear memory access pattern should use memory bandwidth more efficiently than random access shuffle. This will be slower for small data sets when cache provides enough bandwidth to make Fisher-Yates very fast.

But after reading your numbers I’m pretty happy with my scalar shuffle because it can shuffle 52 elements (64 bit ints, aka a pack of cards) and reduce to 4 mask elements in 220-240 cycles. I’m currently using lcg generator and multiplicative bound reduction including bitwise-and optimization for power of two ranges. Removing division was the largest optimization (2200 cycles to 450 cycles).

Second largest improvement was taking stack a temporary stack copy of random number generator (450->280). Without generator copy c++ compiler generated loads and stores for each random number. Loads and stores removed fairly few instructions but those were bottlenecks allowing average instructions per cycle raise from 2.8 to 3.4.

I end up here looking for simd replacement for Fisher-Yates after my attempt to use vpermq to shuffle 4 element partitions resulted to slower code than scalar shuffle. sse/avx permutations have problem that permutation selecting must be immediate value. Immediate value force slow jump table conversion from random number to permutation operation. Result was about 25% reduction to instruction count but 60% reduction to IPC.

Daniel Lemiresays:@Pauli

Thanks for sharing your numbers. Very interesting.

Parallel shuffle can be done inplace using partitions (equal size balanced â€œtreeâ€, unbalanced remainder requires separate shuffling).My blog post is concerned with a “fair” random shuffles where all possible permutations (N!) must be equally likely. If you divide up the inputs into fixed-sized blocks, shuffle them, and remerge, you will not get a fair shuffle. That might be fine, depending on your needs, but if you allow biased shuffles, it opens up many opportunities, in general, depending on the needed quality.

fewer herbsays:This is a bit late, but a fair “merge” shuffle is possible: https://arxiv.org/abs/1508.03167

Albert Chansays:What is the “no division, no bias” method actually called ?

If I googled for it, what should be entered ?

Daniel Lemiresays:If you need a formal reference, please see:

JJShaosays:Is the distribution unbias when “leftover >= range” (without div)?

I mean, take range = 5 & bit length = 4 for example, when random4bit range from 0 to 15, we have:

0, 4, 7, 10, 13 is rejected, 1-3 return 0, 5-6 return 1, 8-9 return 2, 11-12 return 3, 14-15 return 4.

which means 0 has 3/16 chance to return without div, but 1-4 has 2/16 chance to return without div, and 5/16 chance we need 1 div(which afterwards is unbias).

Isn’t this bias?

Daniel Lemiresays:Please refer to the manuscript for the mathematical analysis https://arxiv.org/abs/1805.10941

JJShaosays:I’ve read that. But while the inner “while” loop does satisfied the lemma 4.1, the outer “if” does not since range != 2^L mod range.

Daniel Lemiresays:The algorithm is correct. It provides unbiased results.