24th March 2018, 21 min read

When shuffling large arrays, how much time can be attributed to random number generation?

22 thoughts on “When shuffling large arrays, how much time can be attributed to random number generation?”

Jeff Epler says:

March 24, 2018 at 3:24 am

If it’s really a matter of prediction / prefetch, then generate e.g., 256 numbers and then perform 256 swaps. I dashed off the (C) program to test that theory, so it’s likely that I botched something, but doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory: https://emergent.unpythonic.net/files/sandbox/shuffy.c
1. Daniel Lemire says:
  
  March 24, 2018 at 3:32 am
  
  doing just that seems to get (or slightly beat) the speed of the precomputed shuffle with a fixed overhead of extra memory
  
  I see what you are doing. Let me be clearer: my precomputed approach excludes the generation of the random numbers, so it is not a random shuffle.
  
  I am also not claiming that it is practically useful. It is a demonstration. So I do not really care about memory usage.
  1. Janne says:
    
    March 24, 2018 at 5:58 am
    
    I wonder what the reason is for starting at the end of the array and looping backwards? I suspect that’s slightly less efficient than a forward loop since the CPU is more likely to be able to predict memory accesses in the latter case.
    1. Daniel Lemire says:
      
      March 24, 2018 at 2:19 pm
      
      I suspect that’s slightly less efficient than a forward loop since the CPU is more likely to be able to predict memory accesses in the latter case.
      
      I don’t expect that to be true on recent Intel processors. They are devilishly good at predicting access patterns.
  2. Jeff Epler says:
    
    March 24, 2018 at 3:17 pm
    
    Thanks for the clarification. I’m glad I misunderstood, though. I think my benchmark finding is interesting in its own right.
    
    There was at least one implementation error in my program (negative random numbers were erroneously being generated, leading to out of bounds memory accesses).
    
    I updated the above link with a new version. I still measure version 3 (which alternates generating indices and swapping entries in blocks) to be the faster alternative compared to the traditional algorithm or the one that pregenerates ALL swap indices.
    1. Daniel Lemire says:
      
      March 25, 2018 at 5:48 pm
      
      I updated the above link with a new version. I still measure version 3 (which alternates generating indices and swapping entries in blocks) to be the faster alternative compared to the traditional algorithm or the one that pregenerates ALL swap indices.
      
      That might well be true.
2. Marc Reynolds says:
  
  March 26, 2018 at 8:24 am
  
  FWIW Aside: For an LCG mod should be avoided since the low-bits are garbage and generally mod will introduce bias when range is not a power-of-two (although probably not worth thinking about in this case)
Sean O'Connor says:

March 24, 2018 at 10:25 am

If you use a fast random number generator like one of these:
xoroshiro
Or the RDRAND assembly language instruction.

Then with the Fisherâ€“Yates shuffle algorithm for sure most of the time is spent waiting for memory.

To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.

Also possible for very large arrays is to replace the butterfly operations in an FFT/WHT with butterfly random swaps, though I think that would have the same problem as the naive shuffle with nonuniform distribution of permutations.

Also off topic there is this:
Similarity Alignment
1. Daniel Lemire says:
  
  March 24, 2018 at 2:25 pm
  
  If you use a fast random number generator like one of these: xoroshiro Or the RDRAND assembly language instruction.
  
  The Java random number generator I used in these examples is a simple linear congruential generator. It is hard to be much faster. I expect RDRAND to be far slower.
  
  Then with the Fisherâ€“Yates shuffle algorithm for sure most of the time is spent waiting for memory.
  
  Do you have some benchmark numbers?
  
  To get numbers between 0 and n quick you can multiply a 32 bit random number by n+1 to get a 64 bit product and shift right 32 bits. 64 bit random, 128 bit product is also possible and better.
  
  Yes, see https://lemire.me/blog/2016/06/30/fast-random-shuffling/
  1. Marc Reynolds says:
    
    March 26, 2018 at 8:30 am
    
    I’m pretty sure than ThreadLocalRandom is a cascade of xorshift-multiplies.
Arseny Kapoulkine says:

March 25, 2018 at 1:00 am

I’m not sure the implications are as straightforward as suggested in this post – it must highly depend on the RNG.

1.9s / 100M elements on a 2 GHz CPU is 38 cycles/element; 0.8s is 16 cycles. 20 cycles per RNG operation seems excessive, although I’m not sure what builtin Java RNG is doing – and moreover you can’t get an L3 miss or even an L2 miss to complete in 16 cycles. What seems to be more likely to me is that somehow your RNG is limiting the amount of swap instructions that can get pipelined, thus the same amount of L2/3 misses take more time because fewer of them are issuing concurrently. Alternatively your RNG is slow 🙂

For a more accurate model of what’s going on I’d suggest separately timing filling the indirection array and reordering the actual data.

Here’s a quick C++ program; I’ve copied PCG32 from the PCG web site and the bounded RNG function from your previous blog post, I may have introduced off by one errors in some cases. I’m also including a forward Fisher Yates shuffle (copied from Wikipedia more or less).

https://gist.github.com/zeux/cd41829e610fd3ce33e4ca9a16a16293

The numbers that I get are very different from yours:

shuf 2.195379 s, gen 0.338417 s, shuf-gen 2.731224 s, fwd shuf 2.227119 s
shuf 2.167099 s, gen 0.122208 s, shuf-gen 2.870520 s, fwd shuf 2.314143 s
shuf 2.269147 s, gen 0.122214 s, shuf-gen 2.845593 s, fwd shuf 2.367242 s
shuf 2.270375 s, gen 0.122476 s, shuf-gen 2.867398 s, fwd shuf 2.314249 s
shuf 2.258742 s, gen 0.120076 s, shuf-gen 2.863602 s, fwd shuf 2.328118 s
shuf 2.265792 s, gen 0.126285 s, shuf-gen 2.923946 s, fwd shuf 2.356490 s
shuf 2.264414 s, gen 0.122542 s, shuf-gen 2.880007 s, fwd shuf 2.365302 s

You can see that running the RNG is an order of magnitude faster than shuffling – which is to be expected! – and pre-computing doesn’t help.

One slightly odd part is that the precomputed shuffle is slower.

Using % range instead of (*range)>>32 doesn’t change this significantly:

shuf 2.178649 s, gen 0.472895 s, shuf-gen 2.286348 s, fwd shuf 2.317720 s
shuf 2.186121 s, gen 0.253171 s, shuf-gen 2.286875 s, fwd shuf 2.242044 s
shuf 2.634489 s, gen 0.251526 s, shuf-gen 2.319564 s, fwd shuf 2.079921 s
shuf 2.155671 s, gen 0.250882 s, shuf-gen 2.167186 s, fwd shuf 2.028764 s
shuf 2.106639 s, gen 0.253934 s, shuf-gen 2.246153 s, fwd shuf 2.216789 s
shuf 2.179187 s, gen 0.263325 s, shuf-gen 2.218668 s, fwd shuf 2.288851 s

(note that the first run has gen always running slower than subsequent runs – this is expected since this is the run that pagefaults when writing to freshly allocated memory).
1. Arseny Kapoulkine says:
  
  March 25, 2018 at 1:05 am
  
  ugh, the second set of numbers was taken in a different power mode where the clocks were somewhat higher; the correct run went as follows:
  
  /mnt/c/work $ g++ -O2 rngshuf.cpp && ./a.out
  shuf 2.501412 s, gen 0.489999 s, shuf-gen 3.500264 s, fwd shuf 3.093110 s
  shuf 2.473130 s, gen 0.259736 s, shuf-gen 2.852467 s, fwd shuf 2.476890 s
  shuf 2.495287 s, gen 0.294554 s, shuf-gen 2.757949 s, fwd shuf 2.401504 s
  shuf 2.398867 s, gen 0.252998 s, shuf-gen 2.830588 s, fwd shuf 2.467871 s
  shuf 2.415531 s, gen 0.255211 s, shuf-gen 2.830087 s, fwd shuf 2.483296 s
  
  (so slightly slower than fast bounded RNG but still reasonably close)
  1. Me says:
    
    March 25, 2018 at 11:30 am
    
    Given the mismatch between the C results and the Java results – where the precomputed random version was a lot faster, which would mean that memory accesses are cheaper than random generation – I think it is necessary to look at the generated assembler for both. Otherwise, it is just speculation what is going on here.
    1. Daniel Lemire says:
      
      March 25, 2018 at 5:20 pm
      I modified Arseny’s code so that it follows the algorithms of the blog post and I get very similar numbers:
      
      g++ -O3 -o rngshuf rngshuf.cpp && ./rngshuf Reproducing the Java numbers from blog post https://lemire.me/blog/2018/03/24/when-shuffling-large-arrays-how-much-time-can-be-attributed-to-random-number-generation/#comments Caveat: we use PCG instead of the LCG from Java. java-bound PCG shuffle 1.623154 s precomp shuffle 0.824944 s
      
      Arseny did not try to reproduce the same algorithms, he is benchmarking something different. He shows that you can make the shuffle fast if you use a fast ranged number generator. That’s a different point.
      
      Note also that Arseny’s numbers are somewhat high which suggests that he is not testing on a server PC configured for testing…
      1. Iwan Zotow says:
        
        March 25, 2018 at 8:55 pm
        
        Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache. Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
        
        Daniel Lemire says:
        
        March 25, 2018 at 9:11 pm
        
        Frankly, even benchmark is a bit useless. Your precomputed arrays in taking up RAM and, what is more important, cache. In real life there is bunch of code consuming shuffled array, and it might want to use this cache.
        
        The precomputed version is not meant to be used to shuffle arrays, it is meant to answer the question “When shuffling large arrays, how much time can be attributed to random number generation?” To answer this question, we precompute the random numbers.
        
        Xoroshiro+128 is very fast, takes likely no more than one cache line, and in real life which is NOT measured here, not allocating/using precomputed random number might be a big win for consumers of the shuffled array.
        
        Xoroshiro+128 is not going to be faster than the default thread-local random number generator. Java uses a linear congruential generator. It is very fast.
        
        See https://github.com/lemire/testingRNG
      2. Maynard Handley says:
        
        September 12, 2018 at 12:25 am
        
        Is your interest in the implications for Java, or the implications for CPU HW? Because I find myself agreeing with “Me”; without access to the assembly it’s hard to know what is going on. Is the pipeline of java compiler+JIT inlining both the nextInt() RNG call and the swap()? Does it inline the swap() in the second case but not in the first (because there is some heuristic that functions with function calls in their arguments are not inlined or whatever).
        
        Even if inlining is ideal in both cases, there’s question of exactly how many instructions the RNG results in. Under ideal circumstances we want the critical item, the queue that fills up first, to be (I assume) the L2 MSHRs, resulting in maximal memory level parallelism.
        It’s possible (hard to tell without assembly) that some earlier in-core queue fills up before the MSHRs under the “generate RNG” condition.
        
        Daniel Lemire says:
        
        September 17, 2018 at 12:55 am
        
        Is your interest in the implications for Java, or the implications for CPU HW? Because I find myself agreeing with â€œMeâ€; without access to the assembly it’s hard to know what is going on.
        
        I have posted both the Java and C code, and we get the same results. I posted the assembly resulting from the C code as a gist:
        
        https://gist.github.com/lemire/d2047ce1e3b511c54bb47b779a3028f5
        
        It’s possible (hard to tell without assembly) that some earlier in-core queue fills up before the MSHRs under the â€œgenerate RNGâ€ condition.
        
        Now that I have posted the assembly, can you answer the question?
2. Iwan Zotow says:
  
  March 25, 2018 at 9:03 pm
  
  What would be interesting to test is how it behaves in macro-benchmark – when there is MT consumer of shuffled array, which wants all memory and cache lines for itself. Only in micro-benchmark like yours, 100M arrays (400Mb) is free, not using and polluting L3/L2/L1.
KWillets says:

March 25, 2018 at 5:08 pm

This type of cache miss overhead is also common with string sorting, where a list of pointers is dereferenced in (increasingly) random order.

I’m not aware of any memory prediction schemes that automatically prefetch vectors of references like this, but it seems like it can be done with explicit prefetch instructions.
Oren Tirosh says:

March 26, 2018 at 1:50 pm

The LCG may be simple and fast, but it imposes a pipeline tight pipeline dependency. Try using a number of LCGs in round robin fashion, where you use the previous output of the LCG before updating it.

Alternatively, you can alternate between random precomputation and shuffling. Use precomputed chunks that are large enough to be a good approximation of a long continuous run yet short enough to fit in the L1 cache. The difference in timing with and without the shuffling should be a good microbenchmark. The only remaining overhead would be that of sequentially fetching values for a small buffer in the CPU cache.
1. Daniel Lemire says:
  
  March 26, 2018 at 3:51 pm
  Alternatively, you can alternate between random precomputation and shuffling.
  
  I believe that this the idea described by Jeff Epler (see previous comments). After it was tested by Richard Startin in Java, I updated my GitHub repo.
  
  See https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2018/03/23/src/main/java/me/lemire/microbenchmarks/algorithms/Shuffle.java#L32-L48
  
  It works well. I call it “blocked”. Here are the results…
```
Shuffle.test_shuffle_java                  avgt    5  1.927 Â±  0.016 s/op Shuffle.test_shuffle_java_blocked          avgt    5  1.309 Â±  0.051 s/op Shuffle.test_shuffle_precomp               avgt    5  0.818 Â±  0.001 s/op
```