7th June 2018, 6 min read

Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

8 thoughts on “Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)”

Jack Mott says:

June 7, 2018 at 6:25 pm

If you do not have an AVX-512 cpu, you can still experiment with these on some of the cloud providers, which offer AVX-512 vms.
1. degski says:
  
  July 22, 2018 at 5:22 am
  
  In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I â€œcheatâ€ a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.
  
  Isn’t the cheating in the “I simply write out the random number to a small array in cache”, chances of that happening consistently in the real world are small, unless you’re randomizing your hard-disk.
  1. Daniel Lemire says:
    
    July 22, 2018 at 7:15 pm
    
    Isn’t the cheating in the â€œI simply write out the random number to a small array in cacheâ€, chances of that happening consistently in the real world are small, unless you’re randomizing your hard-disk.
    
    Cheating as opposed to what? How else do you propose to measure how quickly one can generate random numbers?
    1. degski says:
      
      July 23, 2018 at 4:16 am
      
      As opposed to comparing the performance of prng’s in a situation that is more likely to reflect real-world-usage of any prng.
      
      Modern processors will make the operation you’re measuring really fast due to deep pipe-lines and out-of-order execution. In normal code branch mis-prediction will trash the instruction cache often.
      
      So it really comes down to what one wants to measure, only then one might be able to answer the question of how to do that.
      1. Daniel Lemire says:
        
        July 23, 2018 at 1:27 pm
        
        In normal code branch mis-prediction will trash the instruction cache often.
        
        The point you seem to be making is that random number generation might not be the bottleneck. Something else, like the branchy nature of my code, might be the limiting factor. That’s absolutely correct.
        
        degski says:
        
        July 23, 2018 at 2:28 pm
        
        Something else, like the branchy nature of my code, might be the limiting factor.
        
        I’m saying that in a “normal” program (not specifically testing through-put of a prng), other (surrounding) code will trash your instruction cache (and probably your data cache as well), and therefor in a real world situation the tested prng will not do so well as advertised. How less well it will do depends on the prng and the actual implementation of it.
        
        What I’m saying is that you are not testing that, or in other words, it’s not a very useful measure.
        
        Daniel Lemire says:
        
        July 23, 2018 at 4:00 pm
        
        My answer as a blog post: Are vectorized random number generators actually useful?
        
        degski says:
        
        July 23, 2018 at 2:41 pm
        
        Sorry for the split response, but there’s more I think.
        
        The efficiency of the intrinsics depends fully on that the values remain in registers throughout. Iff they don’t, intrinsics are very expensive as the whole lot (512 bytes, way bigger than your data cache) need to be written to memory and back. Again, surrounding code might force those values out of registers, which won’t happen in your test method..
        
        I am not criticizing your implementation(s). I’m just putting question marks along-side your measuring method.
        
        To be honest I don’t have a good response either. M. E. O’Neill is going to publish a blog-post on this particular issue (she told me it’s in the making), so I’m looking forward to that.