If it’s really a matter of prediction / prefetch, then generate e.g., 256 numbers and then perform 256 swaps. I dashed off the (C) program to test that theory, so it’s likely that I botched something, but doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory: https://emergent.unpythonic.net/files/sandbox/shuffy.c
doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory
I see what you are doing. Let me be clearer: my precomputed approach excludes the generation of the random numbers, so it is not a random shuffle.
I am also not claiming that it is practically useful. It is a demonstration. So I do not really care about memory usage.
Jannesays:
I wonder what the reason is for starting at the end of the array and looping backwards? I suspect that’s slightly less efficient than a forward loop since the CPU is more likely to be able to predict memory accesses in the latter case.
Thanks for the clarification. I’m glad I misunderstood, though. I think my benchmark finding is interesting in its own right.
There was at least one implementation error in my program (negative random numbers were erroneously being generated, leading to out of bounds memory accesses).
I updated the above link with a new version. I still measure version 3 (which alternates generating indices and swapping entries in blocks) to be the faster alternative compared to the traditional algorithm or the one that pregenerates ALL swap indices.
I updated the above link with a new version. I still measure version 3 (which alternates generating indices and swapping entries in blocks) to be the faster alternative compared to the traditional algorithm or the one that pregenerates ALL swap indices.
That might well be true.
Marc Reynoldssays:
FWIW Aside: For an LCG mod should be avoided since the low-bits are garbage and generally mod will introduce bias when range is not a power-of-two (although probably not worth thinking about in this case)
Sean O'Connorsays:
If you use a fast random number generator like one of these: xoroshiro
Or the RDRAND assembly language instruction.
Then with the Fisher–Yates shuffle algorithm for sure most of the time is spent waiting for memory.
To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.
Also possible for very large arrays is to replace the butterfly operations in an FFT/WHT with butterfly random swaps, though I think that would have the same problem as the naive shuffle with nonuniform distribution of permutations.
If you use a fast random number generator like one of these: xoroshiro Or the RDRAND assembly language instruction.
The Java random number generator I used in these examples is a simple linear congruential generator. It is hard to be much faster. I expect RDRAND to be far slower.
Then with the Fisher–Yates shuffle algorithm for sure most of the time is spent waiting for memory.
Do you have some benchmark numbers?
To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.
I’m not sure the implications are as straightforward as suggested in this post – it must highly depend on the RNG.
1.9s / 100M elements on a 2 GHz CPU is 38 cycles/element; 0.8s is 16 cycles. 20 cycles per RNG operation seems excessive, although I’m not sure what builtin Java RNG is doing – and moreover you can’t get an L3 miss or even an L2 miss to complete in 16 cycles. What seems to be more likely to me is that somehow your RNG is limiting the amount of swap instructions that can get pipelined, thus the same amount of L2/3 misses take more time because fewer of them are issuing concurrently. Alternatively your RNG is slow 🙂
For a more accurate model of what’s going on I’d suggest separately timing filling the indirection array and reordering the actual data.
Here’s a quick C++ program; I’ve copied PCG32 from the PCG web site and the bounded RNG function from your previous blog post, I may have introduced off by one errors in some cases. I’m also including a forward Fisher Yates shuffle (copied from Wikipedia more or less).
The numbers that I get are very different from yours:
shuf 2.195379 s, gen 0.338417 s, shuf-gen 2.731224 s, fwd shuf 2.227119 s
shuf 2.167099 s, gen 0.122208 s, shuf-gen 2.870520 s, fwd shuf 2.314143 s
shuf 2.269147 s, gen 0.122214 s, shuf-gen 2.845593 s, fwd shuf 2.367242 s
shuf 2.270375 s, gen 0.122476 s, shuf-gen 2.867398 s, fwd shuf 2.314249 s
shuf 2.258742 s, gen 0.120076 s, shuf-gen 2.863602 s, fwd shuf 2.328118 s
shuf 2.265792 s, gen 0.126285 s, shuf-gen 2.923946 s, fwd shuf 2.356490 s
shuf 2.264414 s, gen 0.122542 s, shuf-gen 2.880007 s, fwd shuf 2.365302 s
You can see that running the RNG is an order of magnitude faster than shuffling – which is to be expected! – and pre-computing doesn’t help.
One slightly odd part is that the precomputed shuffle is slower.
Using % range instead of (*range)>>32 doesn’t change this significantly:
shuf 2.178649 s, gen 0.472895 s, shuf-gen 2.286348 s, fwd shuf 2.317720 s
shuf 2.186121 s, gen 0.253171 s, shuf-gen 2.286875 s, fwd shuf 2.242044 s
shuf 2.634489 s, gen 0.251526 s, shuf-gen 2.319564 s, fwd shuf 2.079921 s
shuf 2.155671 s, gen 0.250882 s, shuf-gen 2.167186 s, fwd shuf 2.028764 s
shuf 2.106639 s, gen 0.253934 s, shuf-gen 2.246153 s, fwd shuf 2.216789 s
shuf 2.179187 s, gen 0.263325 s, shuf-gen 2.218668 s, fwd shuf 2.288851 s
(note that the first run has gen always running slower than subsequent runs – this is expected since this is the run that pagefaults when writing to freshly allocated memory).
ugh, the second set of numbers was taken in a different power mode where the clocks were somewhat higher; the correct run went as follows:
/mnt/c/work $ g++ -O2 rngshuf.cpp && ./a.out
shuf 2.501412 s, gen 0.489999 s, shuf-gen 3.500264 s, fwd shuf 3.093110 s
shuf 2.473130 s, gen 0.259736 s, shuf-gen 2.852467 s, fwd shuf 2.476890 s
shuf 2.495287 s, gen 0.294554 s, shuf-gen 2.757949 s, fwd shuf 2.401504 s
shuf 2.398867 s, gen 0.252998 s, shuf-gen 2.830588 s, fwd shuf 2.467871 s
shuf 2.415531 s, gen 0.255211 s, shuf-gen 2.830087 s, fwd shuf 2.483296 s
(so slightly slower than fast bounded RNG but still reasonably close)
Mesays:
Given the mismatch between the C results and the Java results – where the precomputed random version was a lot faster, which would mean that memory accesses are cheaper than random generation – I think it is necessary to look at the generated assembler for both. Otherwise, it is just speculation what is going on here.
g++ -O3 -o rngshuf rngshuf.cpp && ./rngshuf
Reproducing the Java numbers from blog post https://lemire.me/blog/2018/03/24/when-shuffling-large-arrays-how-much-time-can-be-attributed-to-random-number-generation/#comments
Caveat: we use PCG instead of the LCG from Java.
java-bound PCG shuffle 1.623154 s
precomp shuffle 0.824944 s
Arseny did not try to reproduce the same algorithms, he is benchmarking something different. He shows that you can make the shuffle fast if you use a fast ranged number generator. That’s a different point.
Note also that Arseny’s numbers are somewhat high which suggests that he is not testing on a server PC configured for testing…
Iwan Zotowsays:
Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache. Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache.
The precomputed version is not meant to be used to shuffle arrays, it is meant to answer the question “When shuffling large arrays, how much time can be attributed to random number generation?” To answer this question, we precompute the random numbers.
Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
Xoroshiro+128 is not going to be faster than the default thread-local random number generator. Java uses a linear congruential generator. It is very fast.
Is your interest in the implications for Java, or the implications for CPU HW? Because I find myself agreeing with “Me”; without access to the assembly it’s hard to know what is going on. Is the pipeline of java compiler+JIT inlining both the nextInt() RNG call and the swap()? Does it inline the swap() in the second case but not in the first (because there is some heuristic that functions with function calls in their arguments are not inlined or whatever).
Even if inlining is ideal in both cases, there’s question of exactly how many instructions the RNG results in. Under ideal circumstances we want the critical item, the queue that fills up first, to be (I assume) the L2 MSHRs, resulting in maximal memory level parallelism.
It’s possible (hard to tell without assembly) that some earlier in-core queue fills up before the MSHRs under the “generate RNG” condition.
Is your interest in the implications for Java, or the implications for CPU HW? Because I find myself agreeing with “Meâ€; without access to the assembly it’s hard to know what is going on.
I have posted both the Java and C code, and we get the same results. I posted the assembly resulting from the C code as a gist:
It’s possible (hard to tell without assembly) that some earlier in-core queue fills up before the MSHRs under the “generate RNG†condition.
Now that I have posted the assembly, can you answer the question?
Iwan Zotowsays:
What would be interesting to test is how it behaves in macro-benchmark – when there is MT consumer of shuffled array, which wants all memory and cache lines for itself. Only in micro-benchmark like yours, 100M arrays (400Mb) is free, not using and polluting L3/L2/L1.
KWilletssays:
This type of cache miss overhead is also common with string sorting, where a list of pointers is dereferenced in (increasingly) random order.
I’m not aware of any memory prediction schemes that automatically prefetch vectors of references like this, but it seems like it can be done with explicit prefetch instructions.
Oren Tiroshsays:
The LCG may be simple and fast, but it imposes a pipeline tight pipeline dependency. Try using a number of LCGs in round robin fashion, where you use the previous output of the LCG before updating it.
Alternatively, you can alternate between random precomputation and shuffling. Use precomputed chunks that are large enough to be a good approximation of a long continuous run yet short enough to fit in the L1 cache. The difference in timing with and without the shuffling should be a good microbenchmark. The only remaining overhead would be that of sequentially fetching values for a small buffer in the CPU cache.
Alternatively, you can alternate between random precomputation and shuffling.
I believe that this the idea described by Jeff Epler (see previous comments). After it was tested by Richard Startin in Java, I updated my GitHub repo.
If it’s really a matter of prediction / prefetch, then generate e.g., 256 numbers and then perform 256 swaps. I dashed off the (C) program to test that theory, so it’s likely that I botched something, but doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory: https://emergent.unpythonic.net/files/sandbox/shuffy.c
doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory
I see what you are doing. Let me be clearer: my precomputed approach excludes the generation of the random numbers, so it is not a random shuffle.
I am also not claiming that it is practically useful. It is a demonstration. So I do not really care about memory usage.
I wonder what the reason is for starting at the end of the array and looping backwards? I suspect that’s slightly less efficient than a forward loop since the CPU is more likely to be able to predict memory accesses in the latter case.
I suspect that’s slightly less efficient than a forward loop since the CPU is more likely to be able to predict memory accesses in the latter case.
I don’t expect that to be true on recent Intel processors. They are devilishly good at predicting access patterns.
Thanks for the clarification. I’m glad I misunderstood, though. I think my benchmark finding is interesting in its own right.
There was at least one implementation error in my program (negative random numbers were erroneously being generated, leading to out of bounds memory accesses).
I updated the above link with a new version. I still measure version 3 (which alternates generating indices and swapping entries in blocks) to be the faster alternative compared to the traditional algorithm or the one that pregenerates ALL swap indices.
That might well be true.
FWIW Aside: For an LCG mod should be avoided since the low-bits are garbage and generally mod will introduce bias when range is not a power-of-two (although probably not worth thinking about in this case)
If you use a fast random number generator like one of these:
xoroshiro
Or the RDRAND assembly language instruction.
Then with the Fisher–Yates shuffle algorithm for sure most of the time is spent waiting for memory.
To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.
Also possible for very large arrays is to replace the butterfly operations in an FFT/WHT with butterfly random swaps, though I think that would have the same problem as the naive shuffle with nonuniform distribution of permutations.
Also off topic there is this:
Similarity Alignment
If you use a fast random number generator like one of these: xoroshiro Or the RDRAND assembly language instruction.
The Java random number generator I used in these examples is a simple linear congruential generator. It is hard to be much faster. I expect RDRAND to be far slower.
Then with the Fisher–Yates shuffle algorithm for sure most of the time is spent waiting for memory.
Do you have some benchmark numbers?
To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.
Yes, see https://lemire.me/blog/2016/06/30/fast-random-shuffling/
I’m pretty sure than ThreadLocalRandom is a cascade of xorshift-multiplies.
I’m not sure the implications are as straightforward as suggested in this post – it must highly depend on the RNG.
1.9s / 100M elements on a 2 GHz CPU is 38 cycles/element; 0.8s is 16 cycles. 20 cycles per RNG operation seems excessive, although I’m not sure what builtin Java RNG is doing – and moreover you can’t get an L3 miss or even an L2 miss to complete in 16 cycles. What seems to be more likely to me is that somehow your RNG is limiting the amount of swap instructions that can get pipelined, thus the same amount of L2/3 misses take more time because fewer of them are issuing concurrently. Alternatively your RNG is slow 🙂
For a more accurate model of what’s going on I’d suggest separately timing filling the indirection array and reordering the actual data.
Here’s a quick C++ program; I’ve copied PCG32 from the PCG web site and the bounded RNG function from your previous blog post, I may have introduced off by one errors in some cases. I’m also including a forward Fisher Yates shuffle (copied from Wikipedia more or less).
https://gist.github.com/zeux/cd41829e610fd3ce33e4ca9a16a16293
The numbers that I get are very different from yours:
shuf 2.195379 s, gen 0.338417 s, shuf-gen 2.731224 s, fwd shuf 2.227119 s
shuf 2.167099 s, gen 0.122208 s, shuf-gen 2.870520 s, fwd shuf 2.314143 s
shuf 2.269147 s, gen 0.122214 s, shuf-gen 2.845593 s, fwd shuf 2.367242 s
shuf 2.270375 s, gen 0.122476 s, shuf-gen 2.867398 s, fwd shuf 2.314249 s
shuf 2.258742 s, gen 0.120076 s, shuf-gen 2.863602 s, fwd shuf 2.328118 s
shuf 2.265792 s, gen 0.126285 s, shuf-gen 2.923946 s, fwd shuf 2.356490 s
shuf 2.264414 s, gen 0.122542 s, shuf-gen 2.880007 s, fwd shuf 2.365302 s
You can see that running the RNG is an order of magnitude faster than shuffling – which is to be expected! – and pre-computing doesn’t help.
One slightly odd part is that the precomputed shuffle is slower.
Using % range instead of (*range)>>32 doesn’t change this significantly:
shuf 2.178649 s, gen 0.472895 s, shuf-gen 2.286348 s, fwd shuf 2.317720 s
shuf 2.186121 s, gen 0.253171 s, shuf-gen 2.286875 s, fwd shuf 2.242044 s
shuf 2.634489 s, gen 0.251526 s, shuf-gen 2.319564 s, fwd shuf 2.079921 s
shuf 2.155671 s, gen 0.250882 s, shuf-gen 2.167186 s, fwd shuf 2.028764 s
shuf 2.106639 s, gen 0.253934 s, shuf-gen 2.246153 s, fwd shuf 2.216789 s
shuf 2.179187 s, gen 0.263325 s, shuf-gen 2.218668 s, fwd shuf 2.288851 s
(note that the first run has gen always running slower than subsequent runs – this is expected since this is the run that pagefaults when writing to freshly allocated memory).
ugh, the second set of numbers was taken in a different power mode where the clocks were somewhat higher; the correct run went as follows:
/mnt/c/work $ g++ -O2 rngshuf.cpp && ./a.out
shuf 2.501412 s, gen 0.489999 s, shuf-gen 3.500264 s, fwd shuf 3.093110 s
shuf 2.473130 s, gen 0.259736 s, shuf-gen 2.852467 s, fwd shuf 2.476890 s
shuf 2.495287 s, gen 0.294554 s, shuf-gen 2.757949 s, fwd shuf 2.401504 s
shuf 2.398867 s, gen 0.252998 s, shuf-gen 2.830588 s, fwd shuf 2.467871 s
shuf 2.415531 s, gen 0.255211 s, shuf-gen 2.830087 s, fwd shuf 2.483296 s
(so slightly slower than fast bounded RNG but still reasonably close)
Given the mismatch between the C results and the Java results – where the precomputed random version was a lot faster, which would mean that memory accesses are cheaper than random generation – I think it is necessary to look at the generated assembler for both. Otherwise, it is just speculation what is going on here.
I modified Arseny’s code so that it follows the algorithms of the blog post and I get very similar numbers:
Arseny did not try to reproduce the same algorithms, he is benchmarking something different. He shows that you can make the shuffle fast if you use a fast ranged number generator. That’s a different point.
Note also that Arseny’s numbers are somewhat high which suggests that he is not testing on a server PC configured for testing…
Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache. Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache.
The precomputed version is not meant to be used to shuffle arrays, it is meant to answer the question “When shuffling large arrays, how much time can be attributed to random number generation?” To answer this question, we precompute the random numbers.
Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
Xoroshiro+128 is not going to be faster than the default thread-local random number generator. Java uses a linear congruential generator. It is very fast.
See https://github.com/lemire/testingRNG
Is your interest in the implications for Java, or the implications for CPU HW? Because I find myself agreeing with “Me”; without access to the assembly it’s hard to know what is going on. Is the pipeline of java compiler+JIT inlining both the nextInt() RNG call and the swap()? Does it inline the swap() in the second case but not in the first (because there is some heuristic that functions with function calls in their arguments are not inlined or whatever).
Even if inlining is ideal in both cases, there’s question of exactly how many instructions the RNG results in. Under ideal circumstances we want the critical item, the queue that fills up first, to be (I assume) the L2 MSHRs, resulting in maximal memory level parallelism.
It’s possible (hard to tell without assembly) that some earlier in-core queue fills up before the MSHRs under the “generate RNG” condition.
I have posted both the Java and C code, and we get the same results. I posted the assembly resulting from the C code as a gist:
https://gist.github.com/lemire/d2047ce1e3b511c54bb47b779a3028f5
Now that I have posted the assembly, can you answer the question?
What would be interesting to test is how it behaves in macro-benchmark – when there is MT consumer of shuffled array, which wants all memory and cache lines for itself. Only in micro-benchmark like yours, 100M arrays (400Mb) is free, not using and polluting L3/L2/L1.
This type of cache miss overhead is also common with string sorting, where a list of pointers is dereferenced in (increasingly) random order.
I’m not aware of any memory prediction schemes that automatically prefetch vectors of references like this, but it seems like it can be done with explicit prefetch instructions.
The LCG may be simple and fast, but it imposes a pipeline tight pipeline dependency. Try using a number of LCGs in round robin fashion, where you use the previous output of the LCG before updating it.
Alternatively, you can alternate between random precomputation and shuffling. Use precomputed chunks that are large enough to be a good approximation of a long continuous run yet short enough to fit in the L1 cache. The difference in timing with and without the shuffling should be a good microbenchmark. The only remaining overhead would be that of sequentially fetching values for a small buffer in the CPU cache.
I believe that this the idea described by Jeff Epler (see previous comments). After it was tested by Richard Startin in Java, I updated my GitHub repo.
See https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2018/03/23/src/main/java/me/lemire/microbenchmarks/algorithms/Shuffle.java#L32-L48
It works well. I call it “blocked”. Here are the results…