A downside is that it is not portable. So a full solution, supporting all C++ compilers, will be more complex.
Maarten Bosmanssays:
This will not work, as the compiler will happily precalculate the sum and difference and load those constants at run time. This is exactly what your godbolt link shows clang doing.
There is also something funny going on with storing and loading the constants multiple times, so it seems the compiler will generate better code with volatile.
anon cowardsays:
Hey! Good catch, my code is not correct at all.
Here’s a better version in which I actually take a look at the output and I’m pretty sure it’s doing the right thing!
This one is using a less clever and more correct macro to pretend that some instruction read and wrote to a general purpose register, breaking the compiler’s ability to fold constants. The code gen in this version is much better and much more correct than the previous broken one..
this version results in bouncing our values off of register instead of the L1 (as the (volatile has to do, kinda), and performance does seem to be better on both my M1 arm machine and a rather ancient Ivy Bridge machine I have.
On x68:
$ ./a.out
v 932007
f 341029
v 698101
f 340962
v 698105
f 340962
v 705098
f 326151
done!
but this is all a bit dubiously useful. I only mention it because I’ve switched from using volatiles to hide info from compilers to using macros like this, but I usually inspect the ASM much more carefully!
A couple comments, in no particular order:
– Loading fmin to a temporary saves a load, at least on GCC.
– With flags such as -ffast-math / -funsafe-math-optimizations, the volatile trick still doesn’t guarantee that the compiler won’t invalidate your code because it sees 1.0f on both sides of the comparison. See the following failed implementation of a rounds_down variant for an example (though not in your code): https://godbolt.org/z/ef7W7b4c7
Just for fun, here are quick implementations of the other 3 “official” rounding modes in the C standard (that is, not “implementation-defined”): https://godbolt.org/z/x441KdGs1
I’m interested to see a faster variant of fegetround(), supporting all 4 modes, and comparisons against the current standard library version.
By definition, fast-math breaks the floating-point standard compliance, so I expect that you are unlikely to both care about standard compliance and to allow fast-math.
Michael Walcottsays:
Yes, I would imagine using a temporary should improve things for most compilers. Because otherwise the compiler is forced to load it twice to preserve the access of the volatile variable.
I get why you want to use “std::numeric_limits::epsilon() / 2” and that is certainly a number that makes more sense in certain respects. But it needs to be slightly smaller than that since with nearest rounding (1.0f – (std::numeric_limits::epsilon() / 2)) does not equal 1.0f. This is due to the fact epsilon is the difference between the 1.0 and the next representable number GREATER than 1.0. The difference going toward zero happens to be smaller than that. Example of your rounds_to_nearest failing https://godbolt.org/z/86e6778xe because of this.
That is an excellent point; as an alternative to adjusting epsilon (which would need to vary based on sign), selecting a base value precisely representable in the range (1,2) suffices.
As an example, selecting 1.5f instead of 1.0f gives the desired behavior. See corrected implementations here: https://godbolt.org/z/Mr8enbMWo
Thank you for highlighting the value of code review.
It might be worth noting that 1e-100 can *not* be represented exactly, just as the sum of 1 and 1e-100 (this doesn’t matter for the mechanics of the rounding mode estimation, but it confused me).
Something like this: https://godbolt.org/z/svo8hjq9K may be preferable to the volatile.
Specifically
/* Make the compiler believe that something it cannot see is reading
this value either from a register or directly from its location in
memory. */
#define extern_write(p) asm volatile("" : "+rm"(p) : : )
bool rounds_to_nearest() noexcept {
float a = 1.0f + 1e-38f;
float b = 1.0f - 1e-38f;
// fake a write to a and b so compiler
// cannot elide the compare
extern_write(a);
extern_write(b);
return a == b;
}
A downside is that it is not portable. So a full solution, supporting all C++ compilers, will be more complex.
This will not work, as the compiler will happily precalculate the sum and difference and load those constants at run time. This is exactly what your godbolt link shows clang doing.
There is also something funny going on with storing and loading the constants multiple times, so it seems the compiler will generate better code with volatile.
Hey! Good catch, my code is not correct at all.
Here’s a better version in which I actually take a look at the output and I’m pretty sure it’s doing the right thing!
https://godbolt.org/z/rbqjTP5EM
This one is using a less clever and more correct macro to pretend that some instruction read and wrote to a general purpose register, breaking the compiler’s ability to fold constants. The code gen in this version is much better and much more correct than the previous broken one..
#define munge(p) asm (“# munge(” #p “[%0])@” : “+r”(p) : : )
this version results in bouncing our values off of register instead of the L1 (as the (volatile has to do, kinda), and performance does seem to be better on both my M1 arm machine and a rather ancient Ivy Bridge machine I have.
On x68:
$ ./a.out
v 932007
f 341029
v 698101
f 340962
v 698105
f 340962
v 705098
f 326151
done!
but this is all a bit dubiously useful. I only mention it because I’ve switched from using volatiles to hide info from compilers to using macros like this, but I usually inspect the ASM much more carefully!
There is a 25% chance that this test will return true when stochastic rounding is being used.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8905452/
A couple comments, in no particular order:
– Loading fmin to a temporary saves a load, at least on GCC.
– With flags such as -ffast-math / -funsafe-math-optimizations, the volatile trick still doesn’t guarantee that the compiler won’t invalidate your code because it sees 1.0f on both sides of the comparison. See the following failed implementation of a rounds_down variant for an example (though not in your code): https://godbolt.org/z/ef7W7b4c7
Just for fun, here are quick implementations of the other 3 “official” rounding modes in the C standard (that is, not “implementation-defined”):
https://godbolt.org/z/x441KdGs1
I’m interested to see a faster variant of fegetround(), supporting all 4 modes, and comparisons against the current standard library version.
By definition, fast-math breaks the floating-point standard compliance, so I expect that you are unlikely to both care about standard compliance and to allow fast-math.
Yes, I would imagine using a temporary should improve things for most compilers. Because otherwise the compiler is forced to load it twice to preserve the access of the volatile variable.
I get why you want to use “std::numeric_limits::epsilon() / 2” and that is certainly a number that makes more sense in certain respects. But it needs to be slightly smaller than that since with nearest rounding (1.0f – (std::numeric_limits::epsilon() / 2)) does not equal 1.0f. This is due to the fact epsilon is the difference between the 1.0 and the next representable number GREATER than 1.0. The difference going toward zero happens to be smaller than that. Example of your rounds_to_nearest failing https://godbolt.org/z/86e6778xe because of this.
That is an excellent point; as an alternative to adjusting epsilon (which would need to vary based on sign), selecting a base value precisely representable in the range (1,2) suffices.
As an example, selecting 1.5f instead of 1.0f gives the desired behavior. See corrected implementations here: https://godbolt.org/z/Mr8enbMWo
Thank you for highlighting the value of code review.
It might be worth noting that 1e-100 can *not* be represented exactly, just as the sum of 1 and 1e-100 (this doesn’t matter for the mechanics of the rounding mode estimation, but it confused me).
Thank you. I fixed my example to make it easier to follow.