compilermod32(unsigned int):
mov eax, edi
mov edx, -1171354717
mul edx
mov eax, edx
shr eax, 4
imul eax, eax, 22
sub edi, eax
mov eax, edi
ret
whereas GCC trunk generates one less instruction:
compilermod32(unsigned int):
mov eax, edi
mov edx, 3123612579
imul rax, rdx
shr rax, 36
imul eax, eax, 22
sub edi, eax
mov eax, edi
ret
PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.
PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e. a*b%c (all 64-bit unsigned ints, c is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…
On Cannonlake the results are similar (broadly speaking) but the division instruction is much faster.
Theosays:
Thanks! Do you happen to have any numbers?
foobarsays:
Just noting that up to four simple register-register moves per clock cycle can be “free” on Core microarchitectures in the sense they don’t use an execution unit. I’m not certain how much difference these variants make in practice…
Travis Downssays:
Although it doesn’t use an execution unit, it still uses one of the 4 renamer “slots” which are what bottlenecks all recent Intel chips to at most 4 fused-domain ops per cycle. Since recent CPUs are rich in execution units, it is not uncommon that this limit is a bottleneck rather than EUs. Eliminated moves also take up a slot in the ROB, but not in the scheduler.
So I really hesitate to call them free. Very unscientifically I would say that they are something like half the cost of the cheapest real instructions (things like basic integer math).
foobarsays:
Good to know that, I haven’t thought of renamer slots can be such a bottleneck.
BTW, the better GCC generated compilermod32 is still pretty bad code. It could be trivially rewritten as:
compilermod32(unsigned int):
mov rax, 3123612579
imul rax, rdi
shr rax, 36
imul eax, eax, -22
add eax, edi
ret
foobarsays:
Argh, I forgot that plain 32-bit move on amd64 zeroes upper bits of the 64-bit register. Thus the first move can’t be eliminated this way while maintaining correct semantics. The second, though, can.
Travis Downssays:
Yeah this limit of 4 is pretty fundamental and why people still say that Intel chips are 4-wide (since all the other parts of the pipeline are wider these days: 8-wide retirement, 8 EUs, 6 uops from uop cache, or 5 from legacy decoder, etc).
It’s even baked into Intel’s top-down analysis stuff, which is based on the idea of 4 slots per cycle, and assigning each slot to “effectively used” (they call “retiring”) or empty for some reason in a large tree of possible reasons. I.e., if you solve every possible bottleneck in a top-down analysis, you’ll end up at 4 IPC (4 FUPC, fused-uops-per-cycle, but grant me a bit of looseness in the terminology here).
Even nops count against this limit, even though they too do not execute.
Yakovsays:
I believe the original code can do uint64_t mod uint32_t, without any changes, except for fastmod_u32’s signature:)
Yakovsays:
Wrong of course. I thought that because uint128_t can hold the result of uint64_t x uint64_t, but what matters here is the precision.
The most I could get out of testing was you can do a reliable uintA % uintB as long as A + B <= 64. E.g., uint48 % uint16, or uint52 % uint12.
(In uint52 I mean a uint64 with top 12 bits being 0)
mcortesesays:
Wrong. The function called directmod64() returns the quotient, not the reminder.
Interesting!
Btw., GCC 8.1 seems to generate:
compilermod32(unsigned int):
mov eax, edi
mov edx, -1171354717
mul edx
mov eax, edx
shr eax, 4
imul eax, eax, 22
sub edi, eax
mov eax, edi
ret
whereas GCC trunk generates one less instruction:
compilermod32(unsigned int):
mov eax, edi
mov edx, 3123612579
imul rax, rdx
shr rax, 36
imul eax, eax, 22
sub edi, eax
mov eax, edi
ret
PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.
PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e.
a*b%c
(all 64-bit unsigned ints,c
is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…Yes further investigations are warranted.
On Cannonlake the results are similar (broadly speaking) but the division instruction is much faster.
Thanks! Do you happen to have any numbers?
Just noting that up to four simple register-register moves per clock cycle can be “free” on Core microarchitectures in the sense they don’t use an execution unit. I’m not certain how much difference these variants make in practice…
Although it doesn’t use an execution unit, it still uses one of the 4 renamer “slots” which are what bottlenecks all recent Intel chips to at most 4 fused-domain ops per cycle. Since recent CPUs are rich in execution units, it is not uncommon that this limit is a bottleneck rather than EUs. Eliminated moves also take up a slot in the ROB, but not in the scheduler.
So I really hesitate to call them free. Very unscientifically I would say that they are something like half the cost of the cheapest real instructions (things like basic integer math).
Good to know that, I haven’t thought of renamer slots can be such a bottleneck.
BTW, the better GCC generated compilermod32 is still pretty bad code. It could be trivially rewritten as:
compilermod32(unsigned int):
mov rax, 3123612579
imul rax, rdi
shr rax, 36
imul eax, eax, -22
add eax, edi
ret
Argh, I forgot that plain 32-bit move on amd64 zeroes upper bits of the 64-bit register. Thus the first move can’t be eliminated this way while maintaining correct semantics. The second, though, can.
Yeah this limit of 4 is pretty fundamental and why people still say that Intel chips are 4-wide (since all the other parts of the pipeline are wider these days: 8-wide retirement, 8 EUs, 6 uops from uop cache, or 5 from legacy decoder, etc).
It’s even baked into Intel’s top-down analysis stuff, which is based on the idea of 4 slots per cycle, and assigning each slot to “effectively used” (they call “retiring”) or empty for some reason in a large tree of possible reasons. I.e., if you solve every possible bottleneck in a top-down analysis, you’ll end up at 4 IPC (4 FUPC, fused-uops-per-cycle, but grant me a bit of looseness in the terminology here).
Even nops count against this limit, even though they too do not execute.
I believe the original code can do uint64_t mod uint32_t, without any changes, except for fastmod_u32’s signature:)
Wrong of course. I thought that because uint128_t can hold the result of uint64_t x uint64_t, but what matters here is the precision.
The most I could get out of testing was you can do a reliable uintA % uintB as long as A + B <= 64. E.g., uint48 % uint16, or uint52 % uint12.
(In uint52 I mean a uint64 with top 12 bits being 0)
Wrong. The function called directmod64() returns the quotient, not the reminder.
It does return the remainder.