Note, however, that this optimization in LLVM is highly-specialized, and it pattern-matches against the exact implementation you gave and turns it into a popcnt. There is no magic going on, and any number of semantics-preserving rewrites of the C code will cause LLVM to miss the optimization, because the pattern isn’t recognized any longer.
Olegsays:
The one that caught me off guard while trying to get some O(n) loops to benchmark, automatic conversion of sum(i^n) to polynom(n): https://godbolt.org/g/p6RxBh
Anonymoussays:
I probably wouldn’t understand an in-depth explanation, but I would still like to know – at least on superficial level – how clang does this. It seems to me that an extremely sophisticated algorithm would be needed to achieve this.
What if you change the code slightly? Like change x != 0 to x != 1 or something.
Corralxsays:
It does that by transforming the AST in a normal form and then performing some pattern matching on that, trying to find common subsection of codes it can optimize. It can work wonderfully but all these patterns are handcrafted and to be really useful they often require the programmer to know how to write the code the make it easy for the compiler to recognize them.
static inline size_t _count_bits(uint64_t n) {
size_t count;
for (count = 0; n; count++) n &= n - 1;
return count;
}
Another example is this code, which performs a bitwise left shift with wrap (unlike the << operator), and is optimized to either a rotl or rotr instruction depending on the shift size.
I really wish there were a solid cross-platform library for bitwise operations like these which resolve to native instructions where possible, and fall back to decent software implementations on architectures where those instructions aren’t present. Maybe there is and I don’t know what it’s called, or there is a bigger picture I’m missing.
And how many CPU cycles does it take for popcnt to complete? Compiler might save on program size, but you’re ultimately dependent on microcode for non-trivial instructions. That’s not magic either.
> What does that mean? It means that C is a high-level language.
No. This means that C has high-level compilers, if anything.
They happen to recognize the pattern (or patterns) in the code outlined as special cases for optimization, and finally manage to optimize it into a single assembly instruction.
Can confirm the same on OS X’s clang-600.0.54.
Aha: https://llvm.org/bugs/show_bug.cgi?id=1488
It’s been in there for a while apparently.
Note, however, that this optimization in LLVM is highly-specialized, and it pattern-matches against the exact implementation you gave and turns it into a popcnt. There is no magic going on, and any number of semantics-preserving rewrites of the C code will cause LLVM to miss the optimization, because the pattern isn’t recognized any longer.
The one that caught me off guard while trying to get some O(n) loops to benchmark, automatic conversion of sum(i^n) to polynom(n): https://godbolt.org/g/p6RxBh
I probably wouldn’t understand an in-depth explanation, but I would still like to know – at least on superficial level – how clang does this. It seems to me that an extremely sophisticated algorithm would be needed to achieve this.
What if you change the code slightly? Like change x != 0 to x != 1 or something.
It does that by transforming the AST in a normal form and then performing some pattern matching on that, trying to find common subsection of codes it can optimize. It can work wonderfully but all these patterns are handcrafted and to be really useful they often require the programmer to know how to write the code the make it easy for the compiler to recognize them.
You might find this section of the LLVM code interesting:
https://github.com/llvm-mirror/llvm/blob/f36485f7ac2a8d72ad0e0f2134c17fd365272285/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L960
All the idiom recognition stuff is pretty interesting.
compilers look for certain patterns in the logic of what the code does.
A simpler example of this is rotate instructions, where the common idiom is x>>3 | x<<(32-3)
Compilers have recognized that for a long time as a 32bit rotate instruction.
(There are caveats, though, esp for variable shift-counts: http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c)
You might be interested in seeing what’s happening with synthesizing superoptimizers, exploiting the big advances that have been made in SAT solvers.
http://blog.regehr.org/archives/1219
does it do the same for any other standard pop count algorithms?
Current versions of clang and gcc will also optimize this (the K&R implementation for counting bits) to popcount. I tested with the Godbolt compiler explorer):
static inline size_t _count_bits(uint64_t n) {
size_t count;
for (count = 0; n; count++) n &= n - 1;
return count;
}
Another example is this code, which performs a bitwise left shift with wrap (unlike the << operator), and is optimized to either a rotl or rotr instruction depending on the shift size.
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
I really wish there were a solid cross-platform library for bitwise operations like these which resolve to native instructions where possible, and fall back to decent software implementations on architectures where those instructions aren’t present. Maybe there is and I don’t know what it’s called, or there is a bigger picture I’m missing.
@Anonymous, the code is here:
https://github.com/llvm-mirror/llvm/blob/ae889d36724efb174e8cc05d26e655c4c4ab8867/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L1077-L1123
x &= x – 1 is the first example in Waren’s book “Hacker’s delight”.
I think one measure of optimization quality is producing the same code for all programs that do the same thing:
http://shape-of-code.coding-guidelines.com/2011/07/24/compiler-benchmarking-for-the-21st-century/
That’s not magic when you have a basic understanding on hamming weight.
C is a general-purpose language. You can use it as you like.
Disable all compiler optimization and than look at the asm code.
Some cool compiler optimization stuff going on with Spark recently:
https://blog.acolyer.org/2016/05/23/efficiently-compiling-efficient-query-plans-for-modern-hardware/
And how many CPU cycles does it take for popcnt to complete? Compiler might save on program size, but you’re ultimately dependent on microcode for non-trivial instructions. That’s not magic either.
popcnt has a throughput of 1 instruction every cycle on recent x64 processors.
> What does that mean? It means that C is a high-level language.
No. This means that C has high-level compilers, if anything.
They happen to recognize the pattern (or patterns) in the code outlined as special cases for optimization, and finally manage to optimize it into a single assembly instruction.
C is still as low level as it was.