Daniel Lemire's blog

, 2 min read

Fastest way to compute the greatest common divisor

Given two positive integers x and y, the greatest common divisor (GCD) z is the largest number that divides both x and y. For example, given 64 and 32, the greatest common divisor is 32.

There is a fast technique to compute the GCD called the binary GCD algorithm or Stein’s algorithm. According to Wikipedia, it is 60% faster than more common ways to compute the GCD.

I have honestly never written a program where computing the GCD was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s algorithm with remainders:

unsigned gcd_recursive(unsigned a, unsigned b)
    if (b)
        return gcd_recursive(b, a % b);
        return a;

Though Pigeon is a great hacker, I wanted to verify for myself. It seems important to know whether an algorithm that has its own wikipedia page is worth it. Unfortunately, the code on Wikipedia’s page implementing the binary GCD algorithm is either inefficient or slightly wrong. Here is a version using a GCC intrinsic function (__builtin_ctz) to find the number of trailing zeros:

unsigned int gcd(unsigned int u, unsigned int v) {
  int shift;
  if (u == 0)
    return v;
  if (v == 0)
    return u;
  shift = __builtin_ctz(u | v);
  u >>= __builtin_ctz(u);
  do {
    unsigned m;
    v >>= __builtin_ctz(v);
    v -= u;
    m = (int)v >> 31;
    u += v & m;
    v = (v + m) ^ m;
  } while (v != 0);
  return u << shift;

My result? Using integers in [0,2000), the simple version Pigeon proposed does 25 millions GCDs per second, whereas my binary GCD does 39 millions GCDs per second, a difference of 55% on an Intel core i7 desktop. Why do my results disagree with Pigeon? His version of the binary GCD did not make use of the intrinsic __builtin_ctz and used an equivalent loop instead. When I implemented something similarly inefficient, I also got a slower result (17 millions GCDs per second) which corroborates Pigeon’s finding.

My benchmarking code is available.

On a 64-bit machine, you probably can adapt this technique using the __builtin_ctzll intrinsic.

Update: You can read more about sophisticated GCD algorithms in the gmplib manual.

Conclusion: The Binary GCD is indeed a faster way to compute the GCD for 32-bit integers, but only if you use the right instructions (e.g., __builtin_ctz). And someone ought to update the corresponding Wikipedia page.