The awk programming language is defined to represent all numbers in (double precision) floating point. This may have been a more practical decision that I originally expected.
It would be interesting to see how the awka compiler performs versus a native C integer and/or floating point application that is heavy with division. From what I’ve read here, awka could likely beat an a.out in some circumstances.
KWilletssays:
While looking into this, I found there is an FIDIV instruction that auto-converts the divisor from int32 to double, but I don’t see Godbolt choosing it for any compiler. It seems to have the same latency as FDIV, but instead I see compilers picking CVTSI2SD followed by FDIV.
Maybe I’m missing something, but FIDIV seems like it should have lower reciprocal latency than the latter two instructions.
FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more.
Kendall Willetssays:
I thought I saw some architectures in Agner’s tables where FDIV and FIDIV have the same latency, but I didn’t have time to dig into it.
Bryan McNettsays:
Occurs to me that the SSE2 instruction DIVPS performs two 64-bit float divides in parallel, and its modern variants would do four or eight in parallel, respectively. There exists no SSE instruction for integer divides.
It’s not always possible to do so many divides in parallel, but it should be possible to adapt your test to use SSE intrinsics and see a corresponding speedup when doing uint32_t division as float64 using these SIMD instructions.
Bryan McNettsays:
whoops, DIVPD not DIVPS 😐
Denissays:
Daniel, thank you for the great article!
Do you have maybe a good links how integer division is implemented in HW nowadays? I’m rather interested in algorithm and high-level HW design.
I have some idea about how additions and multiplications are implemented in hardware with circuits, but I have no idea whatsoever how the division is implemented at the hardware level…
Bryan McNettsays:
Have we looked at the assembly generated from the test program? I am starting to suspect that the compiler auto-vectorized the loop using SSE instructions. If the compiler used SSE2 to auto-vectorize, then it would have emitted the DIVPD instruction, as I mentioned earlier. This instruction has 2x parallelism, and is expected to have roughly double the performance of a non-parallel DIV instruction.
So one can verify whether I made a mistake in my analysis.
Andrew Patersonsays:
It’s worth bearing in mind, and this may be related to what Kendall Willets saw, that converting an integer to a floating point number is not a simple operation. I haven’t benchmarked this since I had a 386 so take that with a grain of salt…
The awk programming language is defined to represent all numbers in (double precision) floating point. This may have been a more practical decision that I originally expected.
It would be interesting to see how the awka compiler performs versus a native C integer and/or floating point application that is heavy with division. From what I’ve read here, awka could likely beat an a.out in some circumstances.
While looking into this, I found there is an FIDIV instruction that auto-converts the divisor from int32 to double, but I don’t see Godbolt choosing it for any compiler. It seems to have the same latency as FDIV, but instead I see compilers picking CVTSI2SD followed by FDIV.
Maybe I’m missing something, but FIDIV seems like it should have lower reciprocal latency than the latter two instructions.
Some old note that Agner Fog wrote:
FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more.
I thought I saw some architectures in Agner’s tables where FDIV and FIDIV have the same latency, but I didn’t have time to dig into it.
Occurs to me that the SSE2 instruction DIVPS performs two 64-bit float divides in parallel, and its modern variants would do four or eight in parallel, respectively. There exists no SSE instruction for integer divides.
It’s not always possible to do so many divides in parallel, but it should be possible to adapt your test to use SSE intrinsics and see a corresponding speedup when doing uint32_t division as float64 using these SIMD instructions.
whoops, DIVPD not DIVPS 😐
Daniel, thank you for the great article!
Do you have maybe a good links how integer division is implemented in HW nowadays? I’m rather interested in algorithm and high-level HW design.
I have some idea about how additions and multiplications are implemented in hardware with circuits, but I have no idea whatsoever how the division is implemented at the hardware level…
Have we looked at the assembly generated from the test program? I am starting to suspect that the compiler auto-vectorized the loop using SSE instructions. If the compiler used SSE2 to auto-vectorize, then it would have emitted the DIVPD instruction, as I mentioned earlier. This instruction has 2x parallelism, and is expected to have roughly double the performance of a non-parallel DIV instruction.
I posted my code online:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2017/11/16/divide.c
So one can verify whether I made a mistake in my analysis.
It’s worth bearing in mind, and this may be related to what Kendall Willets saw, that converting an integer to a floating point number is not a simple operation. I haven’t benchmarked this since I had a 386 so take that with a grain of salt…