, 12 min read
Ideal divisors: when a division compiles down to just a multiplication
15 thoughts on “Ideal divisors: when a division compiles down to just a multiplication”
, 12 min read
15 thoughts on “Ideal divisors: when a division compiles down to just a multiplication”
Are there similar numbers for 128 and 256-bit unsigned integers? And are these useful in cryptographic contexts? Obviously not as a secret prime, but as something to speed up big-number arithmetic..
Yes.
For 128 bits: 59649589127497217 and 5704689200685129054721
For 256 bits: 1238926361552897 and 93461639715357977769163558199606896584051237541638188580280321
Over the years I’ve noticed that trying to write a lot of numbers to text log can seriously slow down the sections doing the logging. Seems that printf(“%d”,…) is quite the little delay statement. I figure this is down to the the necessity to divide by 10 (with remainder) for every digit thusly printed.
On account of printing human readable numbers, dividing by 10 must be one of the most common divisors encountered. So you’d think that CPU designers would give us dedicated optimized divide-by-10 instructions. I’ll admit I’m pretty out of touch with current instruction sets, but given how doggy printf %d still is, I’d guess they haven’t.
You are right that calls to printf are slow. I do not expect that the conversion from integers to strings is what slows you down in such a case.
For some fast decimal-output ideas, have a look at the Linux kernel binary-to-decimal conversion code. It uses a number of tricks, including smaller constants for restricted ranges and a 200-byte mod-100-to-2-digits lookup table.
I think it is worth noting that this is an instance of strength reduction
(see https://en.wikipedia.org/wiki/Strength_reduction ) and also that compilers really have a cost model to pick when to apply this reduction.
Some HN comments on this post correctly note that the relative cost of division and multiplication varies with time and target architectures.
One problem with the direct computation of a remainder is the need for (using the notation of the 2019 paper) a multiplication with an N+L-bit input, where the numerator is N bits long, and the computation needs L guard bits. (arxiv:2012.12369 defines N differently.)
This means that the algorithm requires the input range be restricted, Yes, we can hash 48-bit pointers on 64-bit machines, but it would be nice to find something that could handle 64-bit inputs directly.
Unfortunately, just dropping the least significant L bits of the intermediate product fails immediately; just computing n/3 requires L=1, and when n = 1,
(n*c>>L) * d >> L
returns 0, while 1 % 3 = 1.Okay, I think, what if I round the division by 2L up? (Adding
(1 << L) - 1
before shifting down by L.)Well, testing with N=16, that fails on 65416%670. L=8, c = 0x61d1, n*c = 0x61a32608, and 0x61=97 is the correct quotient, but 0xa327 * 670 = 0x1ab0012, while the correct remainder is 0x1aa.
But I wonder if it’s possible in general to find some additive constant 0 < b < 2L that would work…
Nope. 65443%670 has nc=0x61ad7713 and requires b <= 0xec, while 16%670 has nc=0x61d10 and requires b >= 0xf0.
The number of bits needed is divisor-dependent (as this blog post demonstrate). You are correct that it does not work for all divisors and word sizes. But, at least for compile-time optimizations, it does not need to.
You might enjoy this software implementation:
https://github.com/lemire/fast_division
The division technique is more useful because it can usually be done with a multiply-high and a right shift. And even the exceptions can be handled with a small number of additional simple instructions.
I was lamenting that the direct remainder computation technique isn’t more useful, because it almost always requires a multiply with an input range wider than the modulo operation it’s emulating. I was (foolishly?) hoping someone might be able to find an improvement.
One application I’m thinking about is the remainder of a bignum modulo a number of small primes, as the first trial-division step in an isprime() test. This generally starts by choosing a single-word divisor d = p1p2p3*…, and finding the 1-word remainder, then checking for divisibility by the individual primes.
Each step consists of finding r = (r << N | limb[i]) % d. N can be chosen to match available operation sizes (64 bits for x86 native divide, 32 bits for emulation via 64-bit multiply), and the primes can be partitioned into divisors d in a variety of ways.
The figure of merit depends on the cost of the remainder and the probability of finding a factor (and therefore obviating future tests). So more smaller divisors can be advantageous if the remainder computation is sped up enough.
If you are lamenting about the speed of 64 bit remainder value calculation for bignums then take a look at the paper of E. W. Mayer (arXiv:1303.0328). I did an implementation on Skylake architecture with < 6 cycles per limb of the given bignum.
For the decomposition of a 64-bit remainder divisibility to the individual primes you don’t need the reduced remainder value – just an indicator if it is 0 or not. I think that can be done with one mullo and one comparison (if my math doesn’t leave me).
Ooh, interesting, thank you! Actually, as soon as I heard the title, I said “oh, that’s right, you probably can…”.
More specifically, when testing for divisibility by multiple primes, I don’t need the true remainder r; s = 2kr is fine, as s ≡ 0 iff r ≡ 0 (mod p).
If k is fixed, then any tests for particular non-zero values of the remainder can also be simply translated.
BTW, that “2kr” was supposed to have a superscript k (2^k r), but despite the fact that it’s documented that HTML can be used in markdown, “2<sup>k</sup>r” renders as “2kr”.
I think there are still corners to improve on remainder range check optimization on modern compilers. For instance, I found out by brute force that x % 7 < 3 where x is uint8_t can be replaced by (uint16_t)(9363 * x) <= 18906. (Compilers seem to emit a dozen-instruction sequence for this.) Similar solutions can be found for other variations, I did just look for these in brute force.
… and with (uint16_t)(9363 * x) < 28087 x can go up to 13109. 9363 is simply ceil(2^16 / 7) and 28087 is round(2^16 * (3/7)). Is there a good reason why such optimizations are not generated? (Of course they should be correct for the whole input range…)
Well, of course it should be 28088 instead of 28087 and 3 * ceil(2^16 / 7). Nonetheless, I’m surprised why remainder range check strength reductions with “next higher word size” aren’t implemented on compilers.