, 2 min read
Converting binary floating-point numbers to integers
You are given a floating-point number, e.g. a double type in Java or C++. You would like to convert it to an integer type… but only if the conversion is exact. In other words, you want to convert the floating-point number to an integer and check if the result is exact.
In C++, you could implement such a function using few lines of code:
bool to_int64_simple(double x, int64_t *out) {
int64_t tmp = int64_t(x);
*out = tmp;
return tmp == x;
}
The code is short in C++, and it will compile to few lines of assembly. In x64, you might get the following:
cvttsd2si rax, xmm0
pxor xmm1, xmm1
mov edx, 0
cvtsi2sd xmm1, rax
mov QWORD PTR [rdi], rax
ucomisd xmm1, xmm0
setnp al
cmovne eax, edx
Technically, the code might rely on undefined behaviour.
You could do it in a different manner. Instead of working with high-level instructions, you could copy your binary floating-point number to a 64-bit word and use your knowledge of the IEEE binary64 standard to extract the mantissa and the exponent.
It is much more code. It also involves pesky branches. I came up with the following routine:
typedef union {
struct {
uint64_t fraction : 52;
uint32_t exp_bias : 11;
uint32_t sign : 1;
} fields;
double value;
} binary64_float;
bool to_int64(double x, int64_t *out) {
binary64_float number = {.value = x};
const int shift = number.fields.exp_bias - 1023 - 52;
if (shift > 0) {
return false;
}
if (shift <= -64) {
if (x == 0) {
*out = 0;
return true;
}
return false;
}
uint64_t integer = number.fields.fraction | 0x10000000000000;
if (integer << (64 + shift)) {
return false;
}
integer >>= -shift;
*out = (number.fields.sign) ? -integer : integer;
return true;
}
How does it compare with the simple and naive routine?
I just wrote a simple benchmark where I iterate over many floating-point numbers in sequence, and I try to do the conversion. I effectively measure the throughput and report the number of nanosecond per function call. My benchmark does not stress the branch predictor. Unsurprisingly, the naive and simple approach can be faster.
| system | long version | simple version |
|---|---|---|
| ARM M1 + LLVM 12 | 0.95 ns/call | 1.1 ns/call |
| Intel 7700K + LLVM 13 | 1.6 ns/call | 1.2 ns/call |
I find reassuring that the simple approach is so fast. I was concerned at first that it would be burdensome.
It is possible that my long-form routine could be improved. However, it seems unlikely that it could ever be much more efficient than the simple version.