1. Cos(atan(x)) has a nice identity as 1/sqrt(x^2+1), while sin is x/sqrt(x^2+1). What about implementing it that way? In this way you have a branch less solution which can be vectorized (although your scenario suggests there’s only one vector).
2. it seems to me that you don’t normalize the input vector. is the input circle in the unit disk? What happens if both x and y are lower than cos(3pi/8) then the vector will be zero, and you have the dead zone dependent on your rouding. If you assume the unit disk, can some of the branches be filded together in that scenario?
In this way you have a branch less solution which can be vectorized
My version can be compiled to branchless code and is vectorizable.
it seems to me that you don’t normalize the input vector.
We are assuming here that the direction vector has been normalized (it is a ‘unit’ direction vector).
If you assume the unit disk, can some of the branches be filded together in that scenario?
You can rewrite the the logic to have fewer apparent branches, but all my branches are subject to becoming mere selection (condition moves).
I certainly do not claim that my code is the most efficient possible. In fact, I am sure you can do better !!!
Pierre B.says:
The original code could avoid both sin and cos by computing as value between 0 and 7 and a switch with hard-coded values. Your code would still be faster vy avoiding atan2, but by a lesser factor.
Lorin K.says:
Remove the if, instead use a ternary operator and then we get branchless code when using clang.
Funny that Clang only successfully goes branchless when using ternary operator instead of if.
Why do you need four comparisons in the first snippet? I think two are enough.
I do not need four, I use four. You are correct that two are sufficient. However, we want to entice the compiler to produce branchless code.
Branchless code may not be faster in absolute terms, but it is has consistent (fixed) performance.
Tim Parkersays:
Putting your ‘branch’ results into a two element vector, and casting the (single evaluation) of the comparison to bool() as an index, is also a technique used to more forcibly result in branch-less coding, with a little less reliance on the compiler. Almost certainly not a issue with this small, localised, example but may be useful when you truly want to remove branches.
At least under GCC, the compiled result may be that you move the two values to the stack and then load back the desired value from the stack. It is significantly more inefficient that a conditional move.
Samuel Leesays:
Heads up that in the latest version of your code/the post you seem to have lost the sin(PI/8) constant.
A had a couple of thoughts on this just getting around to tidying.
1) Handling the making the doubles positive for comparison before making putting the sign back can be achieved more efficiently with the use of fabs and copysign: https://godbolt.org/z/aebx9b48a
This should clearly work and give a modest uplift.
2) You actually don’t need to be using floating point at all for this, and possibly can get much better speed using uint64_ts to perform the logic. You can even avoid using flags 🙂 : https://godbolt.org/z/Yxq3o7hnc
I have not tested this code at all, but I think it should work (assuming you have well-formed inputs – obviously doesn’t handle NaNs for example).
Samuel Leesays:
Ah, just read the new code more closely, and understand the change to remove the sin(PI/8) constant – nice! The same could apply to my suggestions
1. Cos(atan(x)) has a nice identity as 1/sqrt(x^2+1), while sin is x/sqrt(x^2+1). What about implementing it that way? In this way you have a branch less solution which can be vectorized (although your scenario suggests there’s only one vector).
2. it seems to me that you don’t normalize the input vector. is the input circle in the unit disk? What happens if both x and y are lower than cos(3pi/8) then the vector will be zero, and you have the dead zone dependent on your rouding. If you assume the unit disk, can some of the branches be filded together in that scenario?
In this way you have a branch less solution which can be vectorized
My version can be compiled to branchless code and is vectorizable.
it seems to me that you don’t normalize the input vector.
We are assuming here that the direction vector has been normalized (it is a ‘unit’ direction vector).
If you assume the unit disk, can some of the branches be filded together in that scenario?
You can rewrite the the logic to have fewer apparent branches, but all my branches are subject to becoming mere selection (condition moves).
I certainly do not claim that my code is the most efficient possible. In fact, I am sure you can do better !!!
The original code could avoid both sin and cos by computing as value between 0 and 7 and a switch with hard-coded values. Your code would still be faster vy avoiding atan2, but by a lesser factor.
Remove the if, instead use a ternary operator and then we get branchless code when using clang.
Funny that Clang only successfully goes branchless when using ternary operator instead of if.
https://godbolt.org/z/so5hhqvMo
Correct.
I am sorry, I didn’t get it. Why do you need four comparisons in the first snippet? I think two are enough.
Why do you need four comparisons in the first snippet? I think two are enough.
I do not need four, I use four. You are correct that two are sufficient. However, we want to entice the compiler to produce branchless code.
Branchless code may not be faster in absolute terms, but it is has consistent (fixed) performance.
Putting your ‘branch’ results into a two element vector, and casting the (single evaluation) of the comparison to bool() as an index, is also a technique used to more forcibly result in branch-less coding, with a little less reliance on the compiler. Almost certainly not a issue with this small, localised, example but may be useful when you truly want to remove branches.
At least under GCC, the compiled result may be that you move the two values to the stack and then load back the desired value from the stack. It is significantly more inefficient that a conditional move.
Heads up that in the latest version of your code/the post you seem to have lost the sin(PI/8) constant.
A had a couple of thoughts on this just getting around to tidying.
1) Handling the making the doubles positive for comparison before making putting the sign back can be achieved more efficiently with the use of fabs and copysign:
https://godbolt.org/z/aebx9b48a
This should clearly work and give a modest uplift.
2) You actually don’t need to be using floating point at all for this, and possibly can get much better speed using uint64_ts to perform the logic. You can even avoid using flags 🙂 :
https://godbolt.org/z/Yxq3o7hnc
I have not tested this code at all, but I think it should work (assuming you have well-formed inputs – obviously doesn’t handle NaNs for example).
Ah, just read the new code more closely, and understand the change to remove the sin(PI/8) constant – nice! The same could apply to my suggestions
Congratulations, you have the fastest approach. It is under 1 ns on my laptop! I estimate it is about 2.5 cycles per conversion…
I agree that you can almost surely do it without using floats but it could be going too far down the rabbit hole.