Very interesting stuff, Daniel. Do I get it correctly that you use your previously posted trick to avoid modulo operation?
I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?

The same argument also partially true for CPUs: a single branch misprediction is like 10-15 instructions, the same number as a modulo operation.

Do I get it correctly that you use your previously posted trick to avoid modulo operation?

Yes.

I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?

It is not possible to have unbiased numbers, as far as I can tell, without branching, so the code here compares three strategies involving branching. All should involve just about the same amount of branching.

You can skip the branching at the expense of some (small) bias, but it was not tested here.

Thomas MÃ¼ller Grafsays:

I wonder if a different API could speed things up, for example bulk generation.

If the order of numbers doesn’t matter (is it always important?), you could split the job, for example instead of generating 1 million numbers 0..99, generate numbers in blocks: 0..63, 64..95, and 96..99. How many of each block are needed would need to be somewhat random (Poisson distribution I think); calculating that would require O(log n).

Once I had to generate random numbers that need to be unique, and somewhat sorted. Sure, you can make them unique by sorting, but that’s slow. I found a faster algorithm, see this discussion.

Leonid Boytsovsays:Very interesting stuff, Daniel. Do I get it correctly that you use your previously posted trick to avoid modulo operation?

I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?

The same argument also partially true for CPUs: a single branch misprediction is like 10-15 instructions, the same number as a modulo operation.

Daniel Lemiresays:Do I get it correctly that you use your previously posted trick to avoid modulo operation?Yes.

I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?It is not possible to have unbiased numbers, as far as I can tell, without branching, so the code here compares three strategies involving branching. All should involve just about the same amount of branching.

You can skip the branching at the expense of some (small) bias, but it was not tested here.

Thomas MÃ¼ller Grafsays:I wonder if a different API could speed things up, for example bulk generation.

If the order of numbers doesn’t matter (is it always important?), you could split the job, for example instead of generating 1 million numbers 0..99, generate numbers in blocks: 0..63, 64..95, and 96..99. How many of each block are needed would need to be somewhat random (Poisson distribution I think); calculating that would require O(log n).

Once I had to generate random numbers that need to be unique, and somewhat sorted. Sure, you can make them unique by sorting, but that’s slow. I found a faster algorithm, see this discussion.