26th December 2013, 24 min read

Fastest way to compute the greatest common divisor

34 thoughts on “Fastest way to compute the greatest common divisor”

Steven Pigeon says:

December 26, 2013 at 6:53 pm

What I don’t get is that you have a speedup only on numbers of the form m—2^k and n—2^j, and the speed-up is proportional to min(j,k). How do you explain doubling the speed if asymptotically few pairs of numbers are of that form?
lecteur habituel says:

December 26, 2013 at 5:45 pm

euclyd, not euler.

Thanks for the post!
1. Maths Brane says:
  
  December 12, 2018 at 9:18 pm
  
  Yea, I heart Euler, but this is Euclid, all day.
Leonid Boytsov says:

December 26, 2013 at 6:21 pm

Another excellent example of shaving off constants!
Steven Pigeon says:

December 27, 2013 at 9:27 am

They’re not quadratic, they’re O(lg min(a,b)).

see:

http://en.wikipedia.org/wiki/Euclidean_algorithm#Algorithmic_efficiency
Per Persson says:

December 27, 2013 at 12:39 pm

“And someone ought to update the corresponding Wikipedia page.”

Why don’t you do it yourself?
Per Persson says:

December 27, 2013 at 1:32 pm

By the way, the numbers you used for testing are relatively small. More complicated algorithms are often slower for small numbers and don’t show their efficiency until the numbers are bigger. Without using anything bigger than uint32 you could test numbers of size ~1’000’000’000.
Mike says:

December 27, 2013 at 9:07 am

If you care about asymptotics, then both of these are quadratic. For a subquadratic algorithm, you need something like a half-gcd based algorithm.
Daniel Lemire says:

December 27, 2013 at 9:51 am

@Pigeon

It is not necessary for the numbers to be divisible by two for them to benefit from the binary GCD.

Take 3 and 5. After the first pass in the loop you get 3 and 2. The 2 gets back to 1 due to the ctz shift.

The nice thing with the binary GCD is that it does not use any expensive operation (ctz is quite cheap on recent Intel processors) whereas the basic GCD relies on integer division.
Mike says:

December 27, 2013 at 10:04 am

They are quadratic when considering operations with integers that are larger than an unsigned int/long.

For example, see: https://gmplib.org/manual/Greatest-Common-Divisor-Algorithms.html and https://gmplib.org/manual/Binary-GCD.html
Ralph Corderoy says:

December 27, 2013 at 10:53 am

Hi Daniel, I can trim another 12% off your gcd() above by removing the two redundant shifts by “shift” of “u” and “v” that occur before the loop.
Daniel Lemire says:

December 27, 2013 at 11:10 am

@Ralph

Well done. I have updated my blog post and credited you for the gains.
KWillets says:

December 27, 2013 at 12:46 pm

I wonder if you could save a cmpl by reusing the u > v comparison for the loop break as well. That is:

if( u == v)
break;
else if (u > v)
…

This will shorten the last iteration and probably speed up the speculative execution.
Daniel Lemire says:

December 27, 2013 at 1:29 pm

@KWillets

With clang, your version is faster. With GCC, the version in the blog post is faster. The difference is within 10%.

If I played with compiler flags, there might be other differences as well.

In any case, your version is on github if you want to benchmark it.
Daniel Lemire says:

December 27, 2013 at 1:46 pm

@Persson

I have added a test in my code with large numbers but it makes no difference. Of course, these are word-size integers… results would differ with big integers.
Ralph Corderoy says:

December 28, 2013 at 5:57 am

Hi again Daniel, I can save a further 7.5% on my earlier suggestion by altering the loop to

do {
unsigned m;
v >>= __builtin_ctz(v);
m = (v ^ u) & -(v < u);
u ^= m;
v ^= m;
v -= u;
} while (v);
Steven Pigeon says:

December 28, 2013 at 12:37 pm

I have re-run tests with a version using the built-ins. The speed-ups are there indeed: 40% on larger numbers.

http://hbfs.wordpress.com/2013/12/10/the-speed-of-gcd/

(and @Mike I think the state of the art for fast division is O(n^log_2(3)), which is still more than linear, but subquadratic.)
KWillets says:

December 28, 2013 at 11:38 am

For my tweak the assembler output from gcc has the comparison, then a branch to the top of the loop, then the same comparison :(. The second comparison isn’t reachable by any other path either.

Maybe some syntactic shuffling would trigger the optimization; I may give it a few tries later.
KWillets says:

December 28, 2013 at 8:22 pm

This is faster on my version of gcc:

{
int shift, uz, vz;
uz = __builtin_ctz(u);
if ( u == 0) return v;

vz = __builtin_ctz(v);
if ( v == 0) return u;

shift = uz > vz ? vz : uz;

u >>= uz;

do {
v >>= vz;

if (u > v) {

unsigned int t = v;
v = u;
u = t;
}

v = v – u;
vz = __builtin_ctz(v);
} while( v != 0 );

return u << shift;
}

Results:

gcd between numbers in [1 and 2000]
26.4901 17.6991 32.7869 25.974 24.3902 31.746 36.6972

I was actually trying to get it to utilize the fact that ctz sets the == 0 flag when its argument is 0, so a following test against 0 should not need an extra instruction. However the compiler didn't notice. Instead it set up some interesting instruction interleaving so that the v != 0 test is actually u == v before the subtraction; I believe this is to enable ILP.

Also, using an inline xchg instruction for the swap doubles the speed:

gcd between numbers in [1 and 2000]
26.1438 16.3934 33.6134 25.974 25.4777 30.5344 72.7273
gcd between numbers in [1000000001 and 1000002000]
26.1438 16 33.8983 25.974 25.3165 29.6296 72.7273
Daniel Lemire says:

December 29, 2013 at 1:09 pm

@KWillets

Thanks. I have added your code to the benchmark.

Do you have the code for the version with the xchg instruction?
Daniel Lemire says:

December 29, 2013 at 1:09 pm

@Ralph

I added your version to the benchmark.
KWillets says:

December 29, 2013 at 1:23 pm

Here’s the asm for the swap; I just replaced the part inside the brackets with xswap(u,v):

#define xswap(a,b) __asm__ (\
“xchg %0, %1\n”\
: : “r”(a), “r” (b));

Unfortunately I don’t understand if this is correctly defined (I copied it from some poorly-documented examples), but the assembler output looks good.
Daniel Lemire says:

December 29, 2013 at 2:46 pm

@KWillets

I have checked into github a version with your inline assembly (slightly tweaked to be more standard). It is not faster.

When I ran your code “as is” I got failed tests.

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2013/12/26/gcd.cpp
KWillets says:

December 30, 2013 at 12:54 pm

Looking at Steven’s asm listings, I realized that my compiler was significantly behind, so I downloaded 3G of Apple “updates” last night. These results are now from clang-500.2.79.

I started playing around with various ways of getting abs(v-u) (especially when unsigned) and also realized that bsfl(x) == bsfl(-x), so this works for the inner loop on gcdwikipedia5fast:

do {
v >>= vz;
unsigned int diff =v;
diff -= u;
vz = __builtin_ctz(diff);
if( diff == 0 ) break;
if ( v < u ) {
u = v;
v = 0 – diff;
} else
v = diff;

} while( 1 );

If diff is signed 32-bit it's slightly faster, abs(diff) can be used, and the v < u test can be switched to diff < 0 for a slight gain. But it becomes a 31-bit algorithm. I haven't tried signed 64-bit yet.

Using bsfl(diff) instead of v seems to speed it up significantly; it's probably ILP again since it doesn't have to wait for v to finalize.
KWillets says:

December 30, 2013 at 1:02 pm

Hold on, I just tried signed 64-bit and got a huge boost:

do {
v >>= vz;
long long int diff = v ;
diff -= u;
vz = __builtin_ctz(diff);
if( diff == 0 ) break;
if ( diff < 0 )
u = v;
v = abs(diff);

} while( 1 );
Daniel Lemire says:

December 30, 2013 at 1:26 pm

@KWillets

I added these two alternatives to the benchmark.

I find that results vary a lot depending on the compiler and processor. It is hard to identify a clear winner… except that they are all faster than the Euclidean algorithm with remainder.
KWillets says:

December 30, 2013 at 2:37 pm

I checked the new revision and the 64-bit version (7) should use abs() and a few other edits.

Should I be submitting edits to github?
Taeseung Lee says:

December 30, 2016 at 3:19 pm

Thanks for the post!
detailyang says:

April 29, 2019 at 1:54 pm

It’s cool and it’ faster 3x than mod in my golang implement
George Spelvin says:

December 15, 2020 at 2:15 pm

It’s possible to slightly improve Ralph Corderoy’s branch-free code above (Dec. 28 comment) by using a difference delta rather than an xor delta.

If you don’t mind limiting the input range to INT_MAX, the sign of (int)(v-u) can be used to control the swap:

v -= u; mask = (int)v >> 31; u += v & mask; /* u + (v - u) = v */ v = (v + mask) ^ mask; /* Conditional negate ~(v - 1) = -v */

If you want to accept inputs up to UINT_MAX, it’s still possible to combine the subtract and mask formation with a bit of asm magic (x86 AT&T syntax) to get access to the carry flag:

asm("sub %2,%1; sbb %0,%0" : "=r" (mask), "+r" (v) : "g" (u));

Depending on the CPU, it may be worth spending an instruction to clear the mask to avoid a false dependency on its previous value. Add
, “0” (0)
to the end of the list of input parameters. (For those not familiar with GCC asm syntax, the 0 in quotes means that this input operand should be in the same register as output operand 0, the mask. The 0 in parens is the operand value. GCC will generate an xorl instruction to zero the mask.)
1. Daniel Lemire says:
  
  December 15, 2020 at 3:06 pm
  Your proposal was added to the benchmark. Here are the numbers on my laptop:
```
❯ ./gcd
gcd between numbers in [1 and 2000]
Running tests... Ok!
We proceed to report timings (smaller values are better).
basicgcd                    54.0541
gcdwikipedia2               23.5294
gcdwikipedia2fast           66.6667
gcd_recursive               54.0541
gcd_iterative_mod           53.3333
gcdFranke                   68.9655
gcdwikipedia3fast           66.6667
gcdwikipedia4fast           54.0541
gcdwikipedia5fast           66.6667
gcdwikipedia2fastswap       64.5161
gcdwikipedia7fast           86.9565
gcdwikipedia7fast32         85.1064
gcdwikipedia8Spelvin        58.8235
gcd between numbers in [1000000001 and 1000002000]
Running tests... Ok!
We proceed to report timings (smaller values are better).
basicgcd                    54.0541
gcdwikipedia2               23.6686
gcdwikipedia2fast           66.6667
gcd_recursive               53.3333
gcd_iterative_mod           54.0541
gcdFranke                   68.9655
gcdwikipedia3fast           65.5738
gcdwikipedia4fast           54.7945
gcdwikipedia5fast           66.6667
gcdwikipedia2fastswap       64.5161
gcdwikipedia7fast           86.9565
gcdwikipedia7fast32         83.3333
gcdwikipedia8Spelvin        58.8235
```
  1. Clinton Ingram says:
    
    March 12, 2022 at 10:13 pm
    
    I’ve just noticed your benchmarks don’t change the number range between runs — the offset is used only in the tests. With the benchmarks corrected, the mod approach shows faster on the larger value range.
    
    Also noting the label on the results is misleading. The benchmark appears to be calculating ops per ms, not the timings directly, so larger is better.
    
    Results on my TGL-H laptop after corrections:
    
    gcd between numbers in [1 and 2000] Running tests... Ok! Kops/ms (larger values are better). basicgcd 50 gcdwikipedia2 20.3046 gcdwikipedia2fast 44.4444 gcd_recursive 50 gcd_iterative_mod 48.1928 gcdFranke 46.5116 gcdwikipedia3fast 45.4545 gcdwikipedia4fast 61.5385 gcdwikipedia5fast 44.9438 gcdwikipedia2fastswap 43.0108 gcdwikipedia7fast 48.1928 gcdwikipedia7fast32 76.9231 gcdwikipedia8Spelvin 64.5161
    gcd between numbers in [1000000001 and 1000002000] Running tests... Ok! Kops/ms (larger values are better). basicgcd 37.7358 gcdwikipedia2 7.15564 gcdwikipedia2fast 16.4609 gcd_recursive 37.037 gcd_iterative_mod 37.7358 gcdFranke 16.5289 gcdwikipedia3fast 16.3934 gcdwikipedia4fast 22.3464 gcdwikipedia5fast 16.4609 gcdwikipedia2fastswap 16.3265 gcdwikipedia7fast 18.3486 gcdwikipedia7fast32 31.746 gcdwikipedia8Spelvin 23.3918
    
    I’ve submitted a PR with the fixes
    1. Daniel Lemire says:
      
      March 25, 2022 at 1:32 am
      Thanks. Here are my results after merging your fix.
      
      ❯ ./gcd gcd between numbers in [1 and 2000] Running tests... Ok! We proceed to report kops/ms (larger values are better). basicgcd 40.404 gcdwikipedia2 20.1005 gcdwikipedia2fast 54.7945 gcd_recursive 42.5532 gcd_iterative_mod 42.5532 gcdFranke 38.0952 gcdwikipedia3fast 53.3333 gcdwikipedia4fast 42.1053 gcdwikipedia5fast 53.3333 gcdwikipedia2fastswap 55.5556 gcdwikipedia7fast 67.7966 gcdwikipedia7fast32 51.2821 gcdwikipedia8Spelvin 46.5116 gcd_mod_faster 43.956
      gcd between numbers in [1000000001 and 1000002000] Running tests... Ok! We proceed to report kops/ms (larger values are better). basicgcd 30.5344 gcdwikipedia2 7.28597 gcdwikipedia2fast 19.802 gcd_recursive 30.0752 gcd_iterative_mod 30.303 gcdFranke 14.0351 gcdwikipedia3fast 18.4332 gcdwikipedia4fast 14.2857 gcdwikipedia5fast 18.5185 gcdwikipedia2fastswap 18.7793 gcdwikipedia7fast 23.9521 gcdwikipedia7fast32 18.5185 gcdwikipedia8Spelvin 46.5116 gcd_mod_faster 32.5203
Hakuna Matata says:

October 23, 2022 at 12:26 pm

Tested on random unsigned ints from full 32-bit range. `gcdwikipedia4fast()` is the fastest. Not every algorithm survived the test, btw.