In terms of applications, this came up for me recently when looking at an algorithm for exact division (remainder known to be zero).

You can actually do a little better and get 5 correct starting bits with y = (3 * x) ^ 2 and in that case you also only need 4 iterations for the 64 bit inverse.

uint32_t inverse32(uint32_t a)
{
uint32_t t, x = (3*a) ^ 2;
t = a * x - 1;
x *= t * t - t + 1;
t = a * x - 1;
x *= t * t - t + 1;
return x;
}

LCRsays:

I might be late to the party, but here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table (especially efficient if you have the bextr instruction).

uint64_t inverse64(uint64_t a) {
uint8_t t[] = {
2, 174, 210, 190, 66, 174, 210, 254,
2, 46, 82, 190, 66, 46, 82, 254,
};
uint64_t x;
x = t[(a>>1)&15] - a; // 8bit precision
x *= 2 - a * x; // 16
x *= 2 - a * x; // 32
x *= 2 - a * x; // 64
return x;
}

Sam Kramersays:Interesting article!

Those using C++14 or newer can use the constexpr specifier to evaluate the functions at compile time:

constexpr uint64_t f64(uint64_t x, uint64_t y) {

return y * ( 2 – y * x );

}

constexpr uint64_t findInverse64(uint64_t x) {

uint64_t y = x;

y = f64(x,y);

y = f64(x,y);

y = f64(x,y);

y = f64(x,y);

y = f64(x,y);

return y;

}

Daniel Lemiresays:Great!

Does this work with C++11?

Sam Kramersays:C++11 Solution:

constexpr uint64_t rec_f64(uint64_t x, uint64_t y, int n) {

return (n > 0) ? rec_f64(x, f64(x, y), n-1) : f64(x, y);

}

constexpr uint64_t findInverse64(uint64_t x) {

return rec_f64(x, x, 5);

}

See the following blog post to learn why this is necessary:

http://fendrich.se/blog/2012/11/22/compile-time-loops-in-c-plus-plus-11-with-trampolines-and-exponential-recursion/

Brian Kesslersays:In terms of applications, this came up for me recently when looking at an algorithm for exact division (remainder known to be zero).

You can actually do a little better and get 5 correct starting bits with y = (3 * x) ^ 2 and in that case you also only need 4 iterations for the 64 bit inverse.

That’s according to “personal communication from Montgomery” from this paper: https://arxiv.org/pdf/1303.0328.pdf

Albert Chansays:For older machine, this may be faster to calculate 64-bits inverse:

`static uint32_t inverse32(uint32_t a)`

{

uint32_t x = (3*a) ^ 2;

x *= 2 - a * x;

x *= 2 - a * x;

x *= 2 - a * x;

return x;

}

`static uint64_t inverse64(uint64_t a) // (a x) mod 2^64 = 1`

{

uint32_t terms, a0 = a; // a = a1<<32 | a0

uint32_t x0 = inverse32(a0); // a0 x0 = 1

terms = (uint32_t) (a >> 32) * x0; // term a1 x0

terms += ((uint64_t) a0 * x0) >> 32; // term a0 x0 / 2^32

uint32_t x1 = x0 * -terms; // a0 x1 + terms = 0

return (uint64_t) x1 << 32 | x0; // x = a^-1 mod 2^64

}

Albert Chansays:Found an old post with formula that trebled bits accuracy per step.

For inverse32(), only 2 steps are needed.

https://groups.google.com/forum/#!msg/sci.crypt/UI-UMbUnYGk/hX2-wQVyE3oJ

`uint32_t inverse32(uint32_t a)`

{

uint32_t t, x = (3*a) ^ 2;

t = a * x - 1;

x *= t * t - t + 1;

t = a * x - 1;

x *= t * t - t + 1;

return x;

}

LCRsays:I might be late to the party, but here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table (especially efficient if you have the bextr instruction).

`uint64_t inverse64(uint64_t a) {`

uint8_t t[] = {

2, 174, 210, 190, 66, 174, 210, 254,

2, 46, 82, 190, 66, 46, 82, 254,

};

uint64_t x;

x = t[(a>>1)&15] - a; // 8bit precision

x *= 2 - a * x; // 16

x *= 2 - a * x; // 32

x *= 2 - a * x; // 64

return x;

}

Daniel Lemiresays:Here’s a solution to remove some more cycles by hacking an 8bit approximation with a small tableIt is interesting, but I doubt that it will be much faster.