20th February 2019, 8 min read

More fun with fast remainders when the divisor is a constant

12 thoughts on “More fun with fast remainders when the divisor is a constant”

Theo says:

February 21, 2019 at 10:35 am

Interesting!

Btw., GCC 8.1 seems to generate:

compilermod32(unsigned int): mov eax, edi mov edx, -1171354717 mul edx mov eax, edx shr eax, 4 imul eax, eax, 22 sub edi, eax mov eax, edi ret

whereas GCC trunk generates one less instruction:

compilermod32(unsigned int): mov eax, edi mov edx, 3123612579 imul rax, rdx shr rax, 36 imul eax, eax, 22 sub edi, eax mov eax, edi ret

PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.

PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e. a*b%c (all 64-bit unsigned ints, c is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…
1. Daniel Lemire says:
  
  February 21, 2019 at 5:35 pm
  
  Yes further investigations are warranted.
  
  On Cannonlake the results are similar (broadly speaking) but the division instruction is much faster.
  1. Theo says:
    
    February 21, 2019 at 6:19 pm
    
    Thanks! Do you happen to have any numbers?
2. foobar says:
  
  February 22, 2019 at 8:54 am
  
  Just noting that up to four simple register-register moves per clock cycle can be “free” on Core microarchitectures in the sense they don’t use an execution unit. I’m not certain how much difference these variants make in practice…
  1. Travis Downs says:
    
    February 23, 2019 at 1:04 am
    
    Although it doesn’t use an execution unit, it still uses one of the 4 renamer “slots” which are what bottlenecks all recent Intel chips to at most 4 fused-domain ops per cycle. Since recent CPUs are rich in execution units, it is not uncommon that this limit is a bottleneck rather than EUs. Eliminated moves also take up a slot in the ROB, but not in the scheduler.
    
    So I really hesitate to call them free. Very unscientifically I would say that they are something like half the cost of the cheapest real instructions (things like basic integer math).
    1. foobar says:
      
      February 23, 2019 at 8:48 am
      
      Good to know that, I haven’t thought of renamer slots can be such a bottleneck.
      
      BTW, the better GCC generated compilermod32 is still pretty bad code. It could be trivially rewritten as:
      
      compilermod32(unsigned int): mov rax, 3123612579 imul rax, rdi shr rax, 36 imul eax, eax, -22 add eax, edi ret
      1. foobar says:
        
        February 23, 2019 at 9:29 am
        
        Argh, I forgot that plain 32-bit move on amd64 zeroes upper bits of the 64-bit register. Thus the first move can’t be eliminated this way while maintaining correct semantics. The second, though, can.
      2. Travis Downs says:
        
        February 23, 2019 at 7:38 pm
        
        Yeah this limit of 4 is pretty fundamental and why people still say that Intel chips are 4-wide (since all the other parts of the pipeline are wider these days: 8-wide retirement, 8 EUs, 6 uops from uop cache, or 5 from legacy decoder, etc).
        
        It’s even baked into Intel’s top-down analysis stuff, which is based on the idea of 4 slots per cycle, and assigning each slot to “effectively used” (they call “retiring”) or empty for some reason in a large tree of possible reasons. I.e., if you solve every possible bottleneck in a top-down analysis, you’ll end up at 4 IPC (4 FUPC, fused-uops-per-cycle, but grant me a bit of looseness in the terminology here).
        
        Even nops count against this limit, even though they too do not execute.
Yakov says:

December 29, 2019 at 4:04 pm

I believe the original code can do uint64_t mod uint32_t, without any changes, except for fastmod_u32’s signature:)
1. Yakov says:
  
  September 24, 2020 at 2:49 pm
  
  Wrong of course. I thought that because uint128_t can hold the result of uint64_t x uint64_t, but what matters here is the precision.
  The most I could get out of testing was you can do a reliable uintA % uintB as long as A + B <= 64. E.g., uint48 % uint16, or uint52 % uint12.
  (In uint52 I mean a uint64 with top 12 bits being 0)
mcortese says:

December 9, 2021 at 6:25 am

Wrong. The function called directmod64() returns the quotient, not the reminder.
1. Daniel Lemire says:
  
  December 9, 2021 at 1:10 pm
  
  It does return the remainder.