Daniel Lemire's blog

, 8 min read

More fun with fast remainders when the divisor is a constant

12 thoughts on “More fun with fast remainders when the divisor is a constant”

  1. Theo says:

    Interesting!

    Btw., GCC 8.1 seems to generate:

    compilermod32(unsigned int):
    mov eax, edi
    mov edx, -1171354717
    mul edx
    mov eax, edx
    shr eax, 4
    imul eax, eax, 22
    sub edi, eax
    mov eax, edi
    ret

    whereas GCC trunk generates one less instruction:

    compilermod32(unsigned int):
    mov eax, edi
    mov edx, 3123612579
    imul rax, rdx
    shr rax, 36
    imul eax, eax, 22
    sub edi, eax
    mov eax, edi
    ret

    PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.

    PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e. a*b%c (all 64-bit unsigned ints, c is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…

    1. Yes further investigations are warranted.

      On Cannonlake the results are similar (broadly speaking) but the division instruction is much faster.

      1. Theo says:

        Thanks! Do you happen to have any numbers?

    2. foobar says:

      Just noting that up to four simple register-register moves per clock cycle can be “free” on Core microarchitectures in the sense they don’t use an execution unit. I’m not certain how much difference these variants make in practice…

      1. Travis Downs says:

        Although it doesn’t use an execution unit, it still uses one of the 4 renamer “slots” which are what bottlenecks all recent Intel chips to at most 4 fused-domain ops per cycle. Since recent CPUs are rich in execution units, it is not uncommon that this limit is a bottleneck rather than EUs. Eliminated moves also take up a slot in the ROB, but not in the scheduler.

        So I really hesitate to call them free. Very unscientifically I would say that they are something like half the cost of the cheapest real instructions (things like basic integer math).

        1. foobar says:

          Good to know that, I haven’t thought of renamer slots can be such a bottleneck.

          BTW, the better GCC generated compilermod32 is still pretty bad code. It could be trivially rewritten as:

          compilermod32(unsigned int):
          mov rax, 3123612579
          imul rax, rdi
          shr rax, 36
          imul eax, eax, -22
          add eax, edi
          ret

          1. foobar says:

            Argh, I forgot that plain 32-bit move on amd64 zeroes upper bits of the 64-bit register. Thus the first move can’t be eliminated this way while maintaining correct semantics. The second, though, can.

          2. Travis Downs says:

            Yeah this limit of 4 is pretty fundamental and why people still say that Intel chips are 4-wide (since all the other parts of the pipeline are wider these days: 8-wide retirement, 8 EUs, 6 uops from uop cache, or 5 from legacy decoder, etc).

            It’s even baked into Intel’s top-down analysis stuff, which is based on the idea of 4 slots per cycle, and assigning each slot to “effectively used” (they call “retiring”) or empty for some reason in a large tree of possible reasons. I.e., if you solve every possible bottleneck in a top-down analysis, you’ll end up at 4 IPC (4 FUPC, fused-uops-per-cycle, but grant me a bit of looseness in the terminology here).

            Even nops count against this limit, even though they too do not execute.

  2. Yakov says:

    I believe the original code can do uint64_t mod uint32_t, without any changes, except for fastmod_u32’s signature:)

    1. Yakov says:

      Wrong of course. I thought that because uint128_t can hold the result of uint64_t x uint64_t, but what matters here is the precision.
      The most I could get out of testing was you can do a reliable uintA % uintB as long as A + B <= 64. E.g., uint48 % uint16, or uint52 % uint12.
      (In uint52 I mean a uint64 with top 12 bits being 0)

  3. mcortese says:

    Wrong. The function called directmod64() returns the quotient, not the reminder.

    1. It does return the remainder.