In terms of applications, this came up for me recently when looking at an algorithm for exact division (remainder known to be zero).
You can actually do a little better and get 5 correct starting bits with y = (3 * x) ^ 2 and in that case you also only need 4 iterations for the 64 bit inverse.
uint32_t inverse32(uint32_t a)
{
uint32_t t, x = (3*a) ^ 2;
t = a * x - 1;
x *= t * t - t + 1;
t = a * x - 1;
x *= t * t - t + 1;
return x;
}
LCRsays:
I might be late to the party, but here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table (especially efficient if you have the bextr instruction).
uint64_t inverse64(uint64_t a) {
uint8_t t[] = {
2, 174, 210, 190, 66, 174, 210, 254,
2, 46, 82, 190, 66, 46, 82, 254,
};
uint64_t x;
x = t[(a>>1)&15] - a; // 8bit precision
x *= 2 - a * x; // 16
x *= 2 - a * x; // 32
x *= 2 - a * x; // 64
return x;
}
Interesting article!
Those using C++14 or newer can use the constexpr specifier to evaluate the functions at compile time:
constexpr uint64_t f64(uint64_t x, uint64_t y) {
return y * ( 2 – y * x );
}
constexpr uint64_t findInverse64(uint64_t x) {
uint64_t y = x;
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
return y;
}
Great!
Does this work with C++11?
C++11 Solution:
constexpr uint64_t rec_f64(uint64_t x, uint64_t y, int n) {
return (n > 0) ? rec_f64(x, f64(x, y), n-1) : f64(x, y);
}
constexpr uint64_t findInverse64(uint64_t x) {
return rec_f64(x, x, 5);
}
See the following blog post to learn why this is necessary:
http://fendrich.se/blog/2012/11/22/compile-time-loops-in-c-plus-plus-11-with-trampolines-and-exponential-recursion/
In terms of applications, this came up for me recently when looking at an algorithm for exact division (remainder known to be zero).
You can actually do a little better and get 5 correct starting bits with y = (3 * x) ^ 2 and in that case you also only need 4 iterations for the 64 bit inverse.
That’s according to “personal communication from Montgomery” from this paper: https://arxiv.org/pdf/1303.0328.pdf
For older machine, this may be faster to calculate 64-bits inverse:
static uint32_t inverse32(uint32_t a)
{
uint32_t x = (3*a) ^ 2;
x *= 2 - a * x;
x *= 2 - a * x;
x *= 2 - a * x;
return x;
}
static uint64_t inverse64(uint64_t a) // (a x) mod 2^64 = 1
{
uint32_t terms, a0 = a; // a = a1<<32 | a0
uint32_t x0 = inverse32(a0); // a0 x0 = 1
terms = (uint32_t) (a >> 32) * x0; // term a1 x0
terms += ((uint64_t) a0 * x0) >> 32; // term a0 x0 / 2^32
uint32_t x1 = x0 * -terms; // a0 x1 + terms = 0
return (uint64_t) x1 << 32 | x0; // x = a^-1 mod 2^64
}
Found an old post with formula that trebled bits accuracy per step.
For inverse32(), only 2 steps are needed.
https://groups.google.com/forum/#!msg/sci.crypt/UI-UMbUnYGk/hX2-wQVyE3oJ
uint32_t inverse32(uint32_t a)
{
uint32_t t, x = (3*a) ^ 2;
t = a * x - 1;
x *= t * t - t + 1;
t = a * x - 1;
x *= t * t - t + 1;
return x;
}
I might be late to the party, but here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table (especially efficient if you have the bextr instruction).
uint64_t inverse64(uint64_t a) {
uint8_t t[] = {
2, 174, 210, 190, 66, 174, 210, 254,
2, 46, 82, 190, 66, 46, 82, 254,
};
uint64_t x;
x = t[(a>>1)&15] - a; // 8bit precision
x *= 2 - a * x; // 16
x *= 2 - a * x; // 32
x *= 2 - a * x; // 64
return x;
}
Here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table
It is interesting, but I doubt that it will be much faster.