27th June 2016, 7 min read

A fast alternative to the modulo reduction

Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into indexes no larger than N? Suppose you have a hash table with a capacity N. Again, you need to transform your hash values (typically 32-bit or 64-bit integers) down to an index no larger than N. Programmers often get around this problem by making sure that N is a power of two, but that is not always ideal.

We want a map that as fair as possible for an arbitrary integer N. That is, ideally, we would want that there are exactly 2³²/N values mapped to each value in the range {0, 1 ,…, N – 1} when starting from all 2³² 32-bit integers.

Sadly, we cannot have a perfectly fair map if 2³² is not divisible by N. But we can have the next best thing: we can require that there be either floor(2³²/N) or ceil(2³²/N) values mapped to each value in the range.

If N is small compared to 2³², then this map could be considered as good as perfect.

The common solution is to do a modulo reduction: x mod N. (Since we are computer scientists, we define the modulo reduction to be the remainder of the division, unless otherwise stated.)

uint32_t reduce(uint32_t x, uint32_t N) {
  return x % N;
}

How can I tell that it is fair? Well. Let us just run through the values of x starting with 0. You should be able to see that the modulo reduction takes on the values 0, 1, …, N – 1, 0, 1, … as you increment x. Eventually, x arrives at its last value (2³² – 1), at which point the cycle stops, leaving the values 0, 1, …, (2³² – 1) mod N with ceil(2³²/N) occurrences, and the remaining values with floor(2³²/N) occurrences. It is a fair map with a bias for smaller values.

It works, but a modulo reduction involves a division, and divisions are expensive. Much more expensive than multiplications. A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles.

There are fancy tricks to “precompute” a modulo reduction so that it can be transformed into a couple of multiplications as well as a few other operations, as long as N is known ahead of time. Your compiler will make use of them if N is known at compile time. Otherwise, you can use a software library or work out your own formula.

But it turns out that you can do even better! That is, there is an approach that is easy to implement, and provides just as good a map, without the same performance concerns.

Assume that x and N are 32-bit integers, consider the 64-bit product x * N. You have that (x * N) div 2³² is in the range, and it is a fair map.

uint32_t reduce(uint32_t x, uint32_t N) {
  return ((uint64_t) x * (uint64_t) N) >> 32 ;
}

Computing (x * N) div 2³² is very fast on a 64-bit processor. It is a multiplication followed by a shift. On a recent Intel processor, I expect that it has a latency of about 4 cycles and a throughput of at least on call every 2 cycles.

So how fast is our map compared to a 32-bit modulo reduction?

To test it out, I have implemented a benchmark where you repeatedly access random indexes in an array of size N. The indexes are obtained either with a modulo reduction or our approach. On a recent Intel processor (Skylake), I get the following number of CPU cycles per accesses:

modulo reduction	fast range
8.1	2.2

So it is four times faster! No bad.

As usual, my code is freely available.

What can this be good for? Well… if you have been forcing your arrays and hash tables to have power-of-two capacities to avoid expensive divisions, you may be able to use the fast range map to support arbitrary capacities without too much of a performance penalty. You can also generate random numbers in a range faster, which matters if you have a very fast random number generator.

So how can I tell that the map is fair?

By multiplying by N, we take integer values in the range [0, 2³²) and map them to multiples of N in [0, N * 2³²). By dividing by 2³², we map all multiples of N in [0, 2³²) to 0, all multiples of N in [2³², 2 * 2³²) to one, and so forth. To check that this is fair, we just need to count the number of multiples of N in intervals of length 2³². This count must be either ceil(2³²/N) or floor(2³²/N).

Suppose that the first value in the interval is a multiple of N: that is clearly the scenario that maximizes the number of multiples in the interval. How many will we find? Exactly ceil(2³²/N). Indeed, if you draw sub-intervals of length N, then every complete interval begins with a multiple of N and if there is any remainder, then there will be one extra multiple of N. In the worst case scenario, the first multiple of N appears at position N – 1 in the interval. In that case, we get floor(2³²/N) multiples. To see why, again, draw sub-intervals of length N. Every complete sub-interval ends with a multiple of N.

This completes the proof that the map is fair.

For fun, we can be slightly more precise. We have argued that the number of multiples was maximized when a multiple of N appears at the very beginning of the interval of length 2³². At the end, we get an incomplete interval of length 2³² mod N. If instead of having the first multiple of N appear at the very beginning of the interval, it appeared at index 2³² mod N, then there would not be room for the incomplete subinterval at the end. This means that whenever a multiple of N occurs before 2³² mod N, then we shall have ceil(2³²/N) multiples, and otherwise we shall have floor(2³²/N) multiples.

Can we tell which outcomes occur with frequency floor(2³²/N) and which occurs with frequency ceil(2³²/N)? Yes. Suppose we have an output value k. We need to find the location of the first multiple of N no smaller than k 2³². This location is ceil(k 2³² / N) N – k 2³² which we just need to compare with 2³² mod N. If it is smaller, then we have a count of ceil(2³²/N), otherwise we have a count of floor(2³²/N).

You can correct the bias with a rejection, see my post on fast shuffle functions.

Useful code: I published a C/C++ header on GitHub that you can use in your projects.

Further reading:

Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation (to appear)
Google Tensorflow adopted this approach through a contribution by David Andersen (see the commit Switching the presized cuckoo map from using strict mod and Google+ post).
What is arguably the best Open Source Chess engine, Stockfish, also adopted this approach.
The technique described in this blog post is in used within Microsoft Arriba.
math/rand: speed up Int31n with multiply/shift instead of modulo (golang issue 16213), runtime: speed up fastrand() % n (golang commit)
Agner Fog, Pseudo-Random Number Generators for Vector Processors and Multicore Processors, Journal of Modern Applied Statistical Methods, 2015.
Kenneth A. Ross, Efficient Hash Probes on Modern Processors, IBM Research Report RC24100 (W0611-039) November 8, 2006

(Update: I have made the proof more intuitive following a comment by Kendall Willets.)