Daniel Lemire's blog

, 39 min read

Faster remainders when the divisor is a constant: beating compilers and libdivide

49 thoughts on “Faster remainders when the divisor is a constant: beating compilers and libdivide”

  1. James says:

    Your first paragraph states “multiplications which are themselves slower than divisions”, but surely you meant the opposite.

    1. Yes. Thank you, I fixed that.

  2. Looks like it’s worth to try to implement in the compilers.

    1. Brian m says:

      I ageee. This sounds like a great patch which would be integrated into llvm and/or gcc, rather than a library that would need to be included in. Its great to test to the approach, but realistically most people writing code will not use your library only because they dont know it exists…. It would be better if just a normal part of the compile process (in the compiler itself)

  3. Marco says:

    Can javascript do the same function? Thanks

    1. This is applicable to JavaScript. Integer arithmetic in JavaScript is tricky so you would need to focus on a specific problem.

      Evidently, this trick could be implemented in the JavaScript compiler.

      1. Marco says:

        Hi Daniel, you mean pure javascript code can do that?
        Can you explain more , Thanks

      2. Marco says:

        Hi Daniel,i have try below javascript code,but get the wrong result ,can you explain why,Thanks.

        function fmod(divisor,num){
        let c=(0xFFFFFFFFFFFFFFFF)/divisor+1;
        let lowbits=cnum;
        return ((lowbits
        divisor)>>64)
        }

        console.log(fmod(5/18))

        1. Damian Gray says:

          All numbers in JS are floats under the hood so the raw bits are not going to act the same as ints. That’s probably why the function isn’t working as is in pure JS.

        2. Sol says:

          Numbers in JavaScript are stored as doubles, and thus cannot exactly represent an integer over 53 bits. Additionally, the bitwise operators all convert the operands to 32-bit signed integers before performing the operation. You are doubly unable to do 64-bit shifts in JavaScript.

        3. Nathan says:

          The most likely answer is that JavaScript does not use 64 bit integer numbers. It uses 64 bit floating point numbers which happen to look like integers for most development purposes.

  4. Severin Pappadeux says:
    1. It is a similar idea regarding the divisibility test but it does not look the same to me. Our divisibility test has a specific form (n * M < M). Also the answer seems to differentiate between odd and even integers, we do not. Please see our manuscript for the mathematical derivations… https://arxiv.org/abs/1902.01961

      1. Viktor says:

        but you test requires 64 bit type when working with 32 bit integers, while the method from the link uses only 32 bit types

        1. … which makes it even more likely that it is different. The comment was "Looks like …".

          It is not the same.

          Again, please refer to the published manuscript for details.

  5. Steve says:

    Great article and paper! Small typo: “This approach was known Jacobsohn in 1973” should be “This approach was known TO Jacobsohn in 1973”.

    1. I fixed the typo, thanks.

  6. Alexander Monakov says:

    Coincidentally, in September 2018 a slightly extended version of divisibility check shown in Hacker’s Delight was implemented in GCC (so it doesn’t appear in gcc-8, but will be available in gcc-9). Some discussion can be found in GCC PR 82853, and you can see the optimization in action on Compiler Explorer.

  7. degski says:

    All well and good, iff one doesn’t need the quotient [unless I missed something].

  8. Theo says:

    Please, have you also tried it with 64-bits “n”, as it would need 192-bits intermediate computation (if I understood correctly)? Can yo perhaps share some timings as well, in addition to the 32-bits case?

    PS: Intel’s Cannon Lake is supposed to have 10-18 cycles latency for integer division, could make for interesting comparison!

    1. Please, have you also tried it with 64-bits “n”, as it would need 192-bits intermediate computation (if I understood correctly)? Can yo perhaps share some timings as well, in addition to the 32-bits case?

      You do not need 192-bit intermediate computations, 128-bit is enough. However, the engineering gets more complicated if you are to deal with overflows properly. We are inviting people to pursue the research. We do not claim to have written the last word on this. If you are interested in pursuing the research and want to chat, do get in touch with us.

      Intel’s Cannon Lake is supposed to have 10-18 cycles latency for integer division, could make for interesting comparison!

      I have a CannonLake processor right here, and here are the results of the fizzbuzz benchmark:

      $ ./fizzbuzz
      count35(N, &count3, &count5)                :  3.690 cycles per input word (best)  3.698 cycles per input word (avg)
      1666666700 1000000000
      fastcount35(N, &count3, &count5)            :  2.082 cycles per input word (best)  2.085 cycles per input word (avg)
      1666666700 1000000000
      

      (This is GCC 8.1.)

      Of course, these fizzbuzz tests do not use the division instruction. The division instruction would be slower. However, it would be less terrible than on skylake processors.

      1. Daniel Baker says:

        I’ve added support for 64-bit unsigned modulo in a fork here (https://github.com/dnbaker/fastmod/tree/fastmod64). You’re right that 192- (or 256-) bit integers aren’t needed; all you need to do is make sure you handle overflows correctly.

        I’m not saying that this is optimal code, but it does appear to pass the expanded unit tests.

        Of course, division and support for signed integers would be desirable as well, but for my purposes, a fast 64-bit mod is all I need.

        1. Interesting. I will have to have a look at it.

        2. Todd Lehman says:

          Daniel Baker: Thanks for adapting this!

          By the way, your function to compute M:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          __uint128_t M = UINT64_C(0xFFFFFFFFFFFFFFFF);
          M <<= 64;
          M |= UINT64_C(0xFFFFFFFFFFFFFFFF);
          M /= d;
          M += 1;
          return M;
          }

          could, I believe, simply be written like this instead:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (((__uint128_t)0 - 1) / d) + 1;
          }

          or equivalently:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (-(__uint128_t)1 / d) + 1;
          }

          or even:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (~(__uint128_t)0 / d) + 1;
          }

          That is to say, it’s not necessary to construct and composite the two 64-bit portions in the case of all 1-bits.

          1. I did not write these functions (not in the current form). I think that the contributor who rewrote them meant to make them more transparent.

            Your point is well taken, however.

  9. Theo says:

    Thank you for the reply!

  10. Travis Downs says:

    It’s lookd like this could be quite useful in the cases that applies! For example this libdivide issue could be finally updated (looks like someone got to it since I first ran across it).

    They paper could be clearer though that it only applies (for real ALU sizes) to an N/2-bit division on an N-bit machine. Usually you want an N-bit division on an N-bit machine, and that’s what the competing approaches offer.

    This means that on a (typical) 32-bit machine only 16-divisors are supported, and on 64-bit only 32-bit divisors are supported. On machines like most x86 where 32-bit and 64-bit multiplications have the same cost, this can still be useful when you are dealing with narrower-than-machine-word things as your uint32_t benchmarks show.

    This need for double-width multiplications isn’t just indidental: the main speedup is by avoiding the rounding corrections which are required in the quotient approaches: by doubling the width of the operations you avoid the need for rounding.

  11. Thanks for the good words.

    The mathematics can be presented as an extension of GM’s approach, and it is just as general. A commenter on this blog post has reported an application of the mathematical result to 64-bit unsigned division on 64-bit processors.

    An additional insight is that you can avoid the extra work related to avoiding overflows by just using wider registers.

    I have more ideas regarding other applications of the mathematical results (that have nothing to do with 32-bit/64-bit registers)… but I don’t have time right now to pursue it so it will have to wait.

    1. Travis Downs says:

      The mathematics can be presented as an extension of GM’s approach, and
      it is just as general.

      The basic mathematics is general, yes — I think everyone agrees that the remainder can be calculated as frac(n / d) * d and in the abstract that’s just as general as n - trunc(n / d) * d.

      However, the earlier work (and this one) primarily focus on the efficient implementation of these methods on real hardware. That’s the interesting part, if I’m not mistaken?

      In that domain, the general target is to implement W-bit division on W-bit machines, and a large amount of the complexity in analysis and much of the runtime cost comes as a result of the difficulty of doing that.

      So if you are going to implement only W-bit operations assuming that your machine can do 2W-bit arithmetic, I would certainly consider that less general – and comparing W-bit approaches against 2W-bit approaches doesn’t shed much light on the performance of the underlying techniques since it mixes the two effects together. In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.

      A commenter on this blog post has reported an application of the
      mathematical result to 64-bit unsigned division on 64-bit processors.

      I didn’t look at it, but if it were true – and it was as efficient as the narrow-input case you illustrate – then my point is basically moot (although it of course still applies to the paper itself, which doesn’t mention this).

      1. That’s the interesting part, if I’m not mistaken?

        I cannot dictate what is important or interesting to you. But maybe you agree that more than one thing can be interesting at once.

        As far as we know, nobody worked out the mathematics for the direct computation of the remainder. Authors simply ignored the issue. Completing the mathematical framework was important, I think. A direct application of this “new math” is the divisibility test which I think is really nice.

        In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.

        It might be. It is certainly worth exploring. We published our benchmarking code (see link above) and we include in these benchmarks an approach that computes the remainder with a fast “wide” division to get the quotient first. It was often faster than other alternatives, but also generally slower than the direct approach. That does not contradict your “majority” claim… but few people are interested in the second-best approach when the best approach is no more expensive. Because it was unexciting, it is not reported in the published paper even if it appears in our logs and public code.

        1. Travis Downs says:

          I cannot dictate what is important or interesting to you. But maybe
          you agree that more than one thing can be interesting at once.

          As far as we know, nobody worked out the mathematics for the direct
          computation of the remainder. Authors simply ignored the issue.
          Completing the mathematical framework was important, I think.

          Yes, I did not mean to imply that the mathematics weren’t important. What I mean is that at a high level what makes this whole thing interesting (to me, to you, to almost everyone, I think) in the first place is that it is potentially more efficient than any existing approach. Once that is established, then the math becomes very important indeed to establish the correctness and to understand the limits.

          A direct
          application of this “new math” is the divisibility test which I think
          is really nice.

          It is a good point. I haven’t considered the divisibility test aspect of this or evaluated competing approaches.

          but few people are interested in the second-best approach when the best approach is no more expensive.

          That seems strange to me, I would expect the second best approach (i.e., the best existing approach) to be the most important baseline so you can report an improvement over the best existing approach.

          Or do you mean that this second-best approach is also new? Assuming it’s just a 32-bit division using a 64-bit reciprocal, I don’t think that is right: it is well-known that using a large enough reciprocal removes all rounding errors and hence speeds things up – IIRC the needed number of bits is W + log(d) just like the new direct remainder calculation. Almost every treatment of the topic explains that if you have an X-bit reciprocal, the result is exact, and then go on to handle the harder case where W < X and so you need extra steps to get the exact result, because people do really want W-bit math on W-bit machines.

          Yeah, compilers (and libdivide?) don’t seem to take advantage of this special case, but that’s not news: there are a lot of “narrow word” tricks they don’t implement (and some they do, they seem to have been added over time as special cases).

          So for me, the observation with the biggest impact is that compilers and libdivide should special case 32-bit division on 64-bit platforms (where 64-bit multiplication is as fast as 32-bit at least, and the handling of 64-bit constants isn’t too onerous) by using a finer-precision reciprocal approximation that produces an exact result. This applies to division as well, not just remainder, so it has wide applicability. The most extensive empirical justification I know of for this improvement is hidden in a paper talking about “faster remainders via direct computation” though!

          Compilers and libdivide should of course also consider using this new faster remainder algo when only remainder is needed and the inputs are narrower than the word size (that said, I think you can extend your algorithm to work generally for W-bit divisions on W-bit hardware, using the same types of rounding tricks as division uses – that would certainly give a boost to likelihood of integration, I think). They should also use the divisibility check (although again I didn’t really evaluate this part).

          1. The most extensive empirical justification I know of for this
            improvement is hidden in a paper talking about “faster remainders via
            direct computation” though!

            That’s fair criticism.

      2. I have written a follow-up blog post: More fun with fast remainders when the divisor is a constant where I try to elaborate a bit more on the issues.

  12. Johannes says:

    This is really great! Do you happen to know of similar research for “mathematical modulo”? That is, where the modulo function is always a unsigned integer, so instead of

    (−𝑛) mod 𝑑 = −(𝑛 mod 𝑑)

    we have

    (−𝑛) mod 𝑑 = d − (𝑛 mod 𝑑)

    I’m also curious if there are any newer results on floating point remainder (for my application specifically, a floating point numerator and integer divisor).

    1. We have not looked at the "mathematical modulo", but it is probably not too difficult.

      Floating point remainders are probably another story entirely.

  13. Johannes says:

    Great. If I ever get time, I’ll take a look at that with your paper as inspiration. Current implementations seem very slow, either using branching or two modulo operations.

  14. Brian Kessler says:

    For comparison with the Granlund and Montgomery divisibility check, I just put in a couple of changes to implement that check in the Go compiler following the treatment in Hacker’s Delight section 10-17.

    The general method can compute unsigned divisibility by one multiplication and a compare for odd divisors and an additional rotate for even divisors. That matches your method when the divisor is odd, but obviously has the extra rotate instruction when the divisor is even. It also has the benefit of using register-sized operations so can work for 64-bit divisibility checks on 64-bit platforms as well. For signed divisibility, the method is similar but requires an additional add.

    Unsigned divisibility checks: https://github.com/golang/go/commit/a28a9427688846f971017924bcf6e8cb623ffba4

    Signed divisibility checks: https://github.com/golang/go/commit/4d9dd3580624df413d65d83e467fcd6ad4a0168b

  15. Brian Kessler says:

    As a follow-up to the Granlund-Montgomery-Warren divisibility check, it looks like recently released gcc9.1 now includes the Hacker’s Delight section 10-17 version of the check at -O2 and will vectorize it at -O3.

    Would be interesting to re-run the benchmark above on gcc9.1. Also, looking at the generated code for gcc8.1 it looks like the big advantage of the LKK method is that gcc is able to combine both the 3 and 5 checks into a single multiplication and the compiler can’t vectorize either version because count3 and count5 are stored on every loop iteration rather than only at the end of the loop:

    https://godbolt.org/z/DivJrk

    vs.

    https://godbolt.org/z/nb6hMg

    It also looks like LLVM has an open change from Aug 2, 2018 to include the Granlund-Montgomery-Warren divisibility check optimization:

    https://reviews.llvm.org/D50222

  16. RATRI GALUH says:

    What is function of formula n/d = n * (2N/d) / (2N) ?

  17. Paul Mattione says:

    This work is great, thank you. I have a question about fastdiv_s32. It looks like it’s not supported for MSC_VER because there’s no intrinsic for uint64_t * int64_t, right?

    However, can’t we do the math with the unsigned version and fix the sign at the end? We can first cast the numerator to int64_t before computing its absolute value, so that abs(INT_MIN) doesn’t overflow.

    Or is there something I’m missing?

    1. Could you take this discussion to the code repo., maybe issue a PR?

  18. Tomasz Grysztar says:

    In 2017 I have informally published (on my assembly-programming forum) a mathematical proof of a similar family of methods for division and/or computation of remainder: https://board.flatassembler.net/topic.php?p=199784#199784

    I wrote there that it is possible to “get the remainder R by taking high bits of LD”, which is analogous to the “fastmod” method shown here (what was “LD” in my notation is “lowbits * d” here).
    I was not aware that this might be considered novel, otherwise I would pursue the research futher. I still might, as I believe there is still more to be discovered in the area.

    Later I also derived a similar divisibility test through a short and simple reasoning based on 2-adic reciprocals: https://board.flatassembler.net/topic.php?p=211618#211618

    It differs from the one presented as “is_divisible” here, since it uses other set of constants, but works on the same principle of just a single multiplication and then comparison.

    1. Tomasz Grysztar says:

      Correction: my divisibility test seems to be the same as described by Granlund and Montgomery, which is mentioned and improved on here. Only my derivation was a bit different (the language of 2-adic numbers makes it a little simpler).

  19. Thank you for the post and paper. I thought you might be interested to know I just submitted a change to Go’s garbage collector to make use of the tighter precision bounds from your paper. In the benchmarks I’ve run so far, it improved Go program performance by about 1.5%.

    The commit is https://github.com/golang/go/commit/4662029. You can see the old and new runtime code in mbitmap.go. We were already replacing division with multiplication, but also used 2 variable shifts to avoid overflow. We also had several branches to try to avoid the multiplication. But it turns out that a single, unconditional 32-bit multiplication works for our needs and is considerably faster.

    1. I am pleased, thanks for getting in touch. If you’d like to elaborate a bit, I’d love to publish a guest post.

  20. Nick says:

    I am wondering how this is different from Barrett reduction?

    1. The Barrett reduction relies on a branch.

      1. And it works by effectively ‘estimating’ the quotient.

  21. A. Faz says:

    When you suggest F=2*N, this is similar to use Barrett reduction, which is the common case in cryptography for reducing double-sized operands.