I spent some time messing with the compiler’s code generation for OnesCount64, and because I am sometimes a flake, I accidentally generated a variant in which it had the conditional branch, but didn’t have the actual popcount instruction.
My microbenchmark was of a loop that unions two 1024-item slices of uint64, and possibly popcounts them. Time on my laptop came out to:
450ns: union the slices, no popcount
520ns: union the slices, popcount instruction with no branch
600ns: union the slices, always-untaken branch, no popcount (oops)
810ns: go’s normal code (branch plus popcount)
What’s interesting to me is that the popcount instructions add 70ns to execution without the branch, but 210ns to execution with the branch for some reason. I don’t know why, but the net result is that with those branches, the function goes from 450->810ns if I do the popcounts (probably too expensive and unacceptable, better to do the popcounts in a separate pass after multiple runs of this). But if it were 450->520, I’d probably do the popcounts in every pass and drop all the additional complexity…
The generated assembly is better, since the value is loaded in a register once… but Go still checks the register value each and every time… it is depressing. I realize that these are hard problems, but most other optimizing compilers would do better.
paulsays:
I gather you next steps will be a Part 2 blog posting outlining the details of your PR to address these complaints?
https://github.com/golang/go/issues/17566 is an active issue about improving the in-lining cost model that has been open for 4 years. It actually references a previous blog post of yours from 2017 https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/ A brief summary of that issue is that one reason in-lining is poor is that it uses a crude heuristic applied to the AST rather than SSA form of the code when more information is available. There is also a lot of debate around the trade-off between increased code size and improved performance with in-lining.
For the popcount issue, even with the recognition that the feature detection for popcount is a runtime constant, it is still a hard problem to decide on the trade-off between code size and performance. For optimum performance you would want to transform code like.
for (hot loop):
if (Likely) hasPopCount:
usePopCountInst;
else (Unlikely):
useFallBack;
to
if (Likely) hasPopCount:
for (hot loop):
usePopCountInst;
else (Unlikely):
for (hot loop):
useFallBack;
That is, hoist the check out of the loop and duplicate the hot loop and maybe even move the unlikely hot loop into a cold seciton of code to avoid polluting the instruction cache. If you decide to go this path, now there is a decision for how high to hoist the feature check (only very small loops? all the way up to the whole function?), which affects the generated code size. I’m not aware of other approaches for a single binary to efficiently support different instructions at run time, are you?
Also, the go implementation of Stream-VByte that you mention is actually a pure go implementation. I have a version that implements the SSE3 version from your paper at https://github.com/bmkessler/streamvbyte Go does make it fairly easy to implement assembly optimizations. Although the function call overhead due to not being able to in-line functions written in assembly means you can’t just use short snippets of assembly for good performance.
even with the recognition that the feature detection for popcount is a runtime constant, it is still a hard problem to decide on the trade-off between code size and performance.
You make good points.
Just so we are clear, I realize that these problems are hard.
I’m not aware of other approaches for a single binary to efficiently support different instructions at run time, are you?
In theory, you could use a fat binary approach where you have multiple versions of the code, compiled for multiple platforms, and then you select the right one at runtime. That is what simdjson does. As far as I can tell, it is never done automatically.
Also, the go implementation of Stream-VByte that you mention is actually a pure go implementation. I have a version that implements the SSE3 version from your paper at https://github.com/bmkessler/streamvbyte
I corrected the link.
Go does make it fairly easy to implement assembly optimizations.
Yes. With the caveat that you need to do it for sizeable functions.
That is precisely why I still think that Go is underrated. Despite all my criticism, I believe it got many things right.
Mangat Rai Modisays:
Following are some other frustration I have with go compiler
Verbose lambda syntax.
No tail recursion optimization
Non exhaustive switch. This one might be tricky with duck typing type system, but some low handing fruits can be picked.
I don’t agree with Go’s core belief of fast compile time. I believe adding a few seconds in the compile phase is worth it to make the language safer and less verbose
David Hendersonsays:
A good program is explicit in what it intends to do. An important quality. A competent programmer knows which functions are pure and may choose to move those with constant arguments to an initialisation step or to pre-compute constants.
RADsays:
Perhaps this is an unintentional design trade off. Go does not make a strong distinction between debug and release builds. Code optimizations belong in the release build and logically introduce some degree of overhead that slows compilation in the frequent debug builds.
There is some truth in what you suggest, I am sure.
Frank Wesselssays:
With more and more CPUs with many cores coming onto the market (ARM/AWS Graviton2 with 64 cores, AMD EPYC with 48 cores), in an ideal world it would be both possible to keep the crucial-for-productivity short build times as well as devote more processing resources to smarter optimizations.
I’d be happy to pay the price of a cup of coffee per day for this (versus running on a large or even medium instance since it doesn’t matter much anyways for development).
Christopher Changsays:
Excellent blog post.
As a rule of thumb, I expect the Go software I write to be ~2/3 as fast as what I could write in C. I’ve used cgo and assembly to do better, but neither is currently good enough for me to want to use them in personal projects; I’m still more productive writing everything in C/C++ whenever it’s worth the effort to do better than 2/3-speed. But this isn’t due to some fundamental limitation of the language, it’s almost entirely driven by compiler weaknesses like the ones you describe here.
ylluminatesays:
Keep your eye on V then (Go done right with the speed of C and safety of Rust, etc.):
A comment on reading comments:
When I click on “Comments” at the end of the email, I go to the start of the article instead of the comments. I then have to scroll through the already-read article to get to the comments.
A small thing, but I find it annoying.
Using Brave on OS X 10.13.6.
Twirrimsays:
What are the drawbacks to too aggressively in-lining code?
Binary size seems like the most obvious thing to me, but are there performance risks from it?
The two big drawbacks are bigger binaries and slower compilation; the effects there are very noticeable. And surveys of gophers consistently show more concern about binary size and compilation speed than performance.
Binary size can actually improve due to increased and/or smarter inlining (although it often does not). Compilation speed never does. Disabling inlining speeds up compilation by 10%. The challenge is not just to increase inlining, which is easy, but to use our (implicit) inlining budget better, which is not easy.
In my experiments, performance degradation to increased inlining is rare but possible, at least on amd64. One extreme example occurs with helper functions in machine-generated code. The compiler itself uses many of these, and increasing inlining can cause these to blow up, causing noticeable performance regressions.
Not that it matters, really–we’re just two people chatting in the comments of a blog–but I’m personally more open to annotations and hints now than I was when I first replied on that issue. The no-knobs philosophy is predicated on having the toolchain eventually be good enough that the knobs don’t buy you much.
I’ve come to believe that there are a few places where toolchain progress is extremely difficult due to trade-offs vs:
compiler performance (e.g. powerful BCE)
binary size (e.g. inlining)
usability (e.g. FGO/PGO)
available compiler programmer hours (lots of things)
difficult API design (e.g. vector instructions without assembly)
stability (e.g. if we change inlining heuristics and 90% of programs benefit and 30% regress we will spend an untold amount of time and energy dealing with those regressions)
In some such cases, I’m increasingly open to providing a pressure release valve such as inlining annotations or unsafe.Unreachable (https://github.com/golang/go/issues/30582). However, I am not the decision maker; not even close.
Knobs notwithstanding, I don’t think the dream of better inlining in Go is dead, despite the widespread despair. If we move inlining later in the compiler, we will be in a position to do much better. And it will be easier to improve the heuristics. And it’ll provide enough disruption that we can sneak in a bunch of other improvements under the same umbrella.
As long as I’m writing a novel, I might mention that one source of despair is that to a first approximation no one is even working on inlining. I don’t know why the Go team has not prioritized that; I have very little insight into their decisions. I can say that I am not because making meaningful progress requires more time more consistently than I can dedicate as a just-for-fun volunteer. I will note, though, that “no one is working on it” is only true until it is not. 🙂
They ruthlessly optimize for compile time, which values the programmer’s time over the users’ time. Since there might be thousands or millions of users, and a program might run millions of times, this never made any sense to me.
I want to let applications compile over a weekend for maximum optimization if it’s something that will run thousands or millions of times, or run for thousands of hours. It would be great to have an alternative compiler for Go, a true and modern optimizing compiler. I’ve thought about the business prospects for one, a proprietary compiler and toolchain – seems like there’s a market for one, and the Google compiler just isn’t very good. It doesn’t leverage modern CPUs, and it doesn’t seem to support Whole Program Optimization / LTO, or Profile Guided Optimization. LTO is less relevant for Go than C/C++, but there are still some opportunities.
I spent some time messing with the compiler’s code generation for OnesCount64, and because I am sometimes a flake, I accidentally generated a variant in which it had the conditional branch, but didn’t have the actual popcount instruction.
My microbenchmark was of a loop that unions two 1024-item slices of uint64, and possibly popcounts them. Time on my laptop came out to:
450ns: union the slices, no popcount
520ns: union the slices, popcount instruction with no branch
600ns: union the slices, always-untaken branch, no popcount (oops)
810ns: go’s normal code (branch plus popcount)
What’s interesting to me is that the popcount instructions add 70ns to execution without the branch, but 210ns to execution with the branch for some reason. I don’t know why, but the net result is that with those branches, the function goes from 450->810ns if I do the popcounts (probably too expensive and unacceptable, better to do the popcounts in a separate pass after multiple runs of this). But if it were 450->520, I’d probably do the popcounts in every pass and drop all the additional complexity…
GC tip (to be gc1.15 soon) already seems to improve the situation:
https://godbolt.org/z/ijn8bP
due to https://github.com/golang/go/commit/fff7509d472778cae5e652dbe2479929c666c24f
The generated assembly is better, since the value is loaded in a register once… but Go still checks the register value each and every time… it is depressing. I realize that these are hard problems, but most other optimizing compilers would do better.
I gather you next steps will be a Part 2 blog posting outlining the details of your PR to address these complaints?
These are difficult problems unlikely to be solved with a PR or two. You need a concerted effort.
Regarding the PR: Josh is a great hacker. But I am sure he would agree that we can and should do better. Yet he can’t solve it all.
I agree. 🙂
https://github.com/golang/go/issues/17566 is an active issue about improving the in-lining cost model that has been open for 4 years. It actually references a previous blog post of yours from 2017 https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/ A brief summary of that issue is that one reason in-lining is poor is that it uses a crude heuristic applied to the AST rather than SSA form of the code when more information is available. There is also a lot of debate around the trade-off between increased code size and improved performance with in-lining.
For the popcount issue, even with the recognition that the feature detection for popcount is a runtime constant, it is still a hard problem to decide on the trade-off between code size and performance. For optimum performance you would want to transform code like.
for (hot loop):
if (Likely) hasPopCount:
usePopCountInst;
else (Unlikely):
useFallBack;
to
if (Likely) hasPopCount:
for (hot loop):
usePopCountInst;
else (Unlikely):
for (hot loop):
useFallBack;
That is, hoist the check out of the loop and duplicate the hot loop and maybe even move the unlikely hot loop into a cold seciton of code to avoid polluting the instruction cache. If you decide to go this path, now there is a decision for how high to hoist the feature check (only very small loops? all the way up to the whole function?), which affects the generated code size. I’m not aware of other approaches for a single binary to efficiently support different instructions at run time, are you?
Also, the go implementation of Stream-VByte that you mention is actually a pure go implementation. I have a version that implements the SSE3 version from your paper at https://github.com/bmkessler/streamvbyte Go does make it fairly easy to implement assembly optimizations. Although the function call overhead due to not being able to in-line functions written in assembly means you can’t just use short snippets of assembly for good performance.
You make good points.
Just so we are clear, I realize that these problems are hard.
In theory, you could use a fat binary approach where you have multiple versions of the code, compiled for multiple platforms, and then you select the right one at runtime. That is what simdjson does. As far as I can tell, it is never done automatically.
I corrected the link.
Yes. With the caveat that you need to do it for sizeable functions.
That is precisely why I still think that Go is underrated. Despite all my criticism, I believe it got many things right.
Following are some other frustration I have with go compiler
Verbose lambda syntax.
No tail recursion optimization
Non exhaustive switch. This one might be tricky with duck typing type system, but some low handing fruits can be picked.
I don’t agree with Go’s core belief of fast compile time. I believe adding a few seconds in the compile phase is worth it to make the language safer and less verbose
A good program is explicit in what it intends to do. An important quality. A competent programmer knows which functions are pure and may choose to move those with constant arguments to an initialisation step or to pre-compute constants.
Perhaps this is an unintentional design trade off. Go does not make a strong distinction between debug and release builds. Code optimizations belong in the release build and logically introduce some degree of overhead that slows compilation in the frequent debug builds.
There is some truth in what you suggest, I am sure.
With more and more CPUs with many cores coming onto the market (ARM/AWS Graviton2 with 64 cores, AMD EPYC with 48 cores), in an ideal world it would be both possible to keep the crucial-for-productivity short build times as well as devote more processing resources to smarter optimizations.
I’d be happy to pay the price of a cup of coffee per day for this (versus running on a
large
or evenmedium
instance since it doesn’t matter much anyways for development).Excellent blog post.
As a rule of thumb, I expect the Go software I write to be ~2/3 as fast as what I could write in C. I’ve used cgo and assembly to do better, but neither is currently good enough for me to want to use them in personal projects; I’m still more productive writing everything in C/C++ whenever it’s worth the effort to do better than 2/3-speed. But this isn’t due to some fundamental limitation of the language, it’s almost entirely driven by compiler weaknesses like the ones you describe here.
Keep your eye on V then (Go done right with the speed of C and safety of Rust, etc.):
https://vlang.io/
https://github.com/vlang/v
Discord is great: https://discord.gg/vlang
Found your article through a comment from the creator of V on Discord: https://discord.gg/vlang (see this specifically)
Nice observations!
A comment on reading comments:
When I click on “Comments” at the end of the email, I go to the start of the article instead of the comments. I then have to scroll through the already-read article to get to the comments.
A small thing, but I find it annoying.
Using Brave on OS X 10.13.6.
What are the drawbacks to too aggressively in-lining code?
Binary size seems like the most obvious thing to me, but are there performance risks from it?
The two big drawbacks are bigger binaries and slower compilation; the effects there are very noticeable. And surveys of gophers consistently show more concern about binary size and compilation speed than performance.
Binary size can actually improve due to increased and/or smarter inlining (although it often does not). Compilation speed never does. Disabling inlining speeds up compilation by 10%. The challenge is not just to increase inlining, which is easy, but to use our (implicit) inlining budget better, which is not easy.
In my experiments, performance degradation to increased inlining is rare but possible, at least on amd64. One extreme example occurs with helper functions in machine-generated code. The compiler itself uses many of these, and increasing inlining can cause these to blow up, causing noticeable performance regressions.
Of course, a reasonable solution would be for the compiler to accept inlining hints.
That’s a complicated topic. 🙂 See https://github.com/golang/go/issues/21536 for a flavor of the discussion around it.
I was aware of this thread, and I had read your reply some time ago.
Not that it matters, really–we’re just two people chatting in the comments of a blog–but I’m personally more open to annotations and hints now than I was when I first replied on that issue. The no-knobs philosophy is predicated on having the toolchain eventually be good enough that the knobs don’t buy you much.
I’ve come to believe that there are a few places where toolchain progress is extremely difficult due to trade-offs vs:
compiler performance (e.g. powerful BCE)
binary size (e.g. inlining)
usability (e.g. FGO/PGO)
available compiler programmer hours (lots of things)
difficult API design (e.g. vector instructions without assembly)
stability (e.g. if we change inlining heuristics and 90% of programs benefit and 30% regress we will spend an untold amount of time and energy dealing with those regressions)
In some such cases, I’m increasingly open to providing a pressure release valve such as inlining annotations or unsafe.Unreachable (https://github.com/golang/go/issues/30582). However, I am not the decision maker; not even close.
Knobs notwithstanding, I don’t think the dream of better inlining in Go is dead, despite the widespread despair. If we move inlining later in the compiler, we will be in a position to do much better. And it will be easier to improve the heuristics. And it’ll provide enough disruption that we can sneak in a bunch of other improvements under the same umbrella.
As long as I’m writing a novel, I might mention that one source of despair is that to a first approximation no one is even working on inlining. I don’t know why the Go team has not prioritized that; I have very little insight into their decisions. I can say that I am not because making meaningful progress requires more time more consistently than I can dedicate as a just-for-fun volunteer. I will note, though, that “no one is working on it” is only true until it is not. 🙂
This is exactly my complaint about Go. I posted about it two months ago: https://www.reddit.com/r/golang/comments/fp9mrk/how_much_performance_headroom_is_left_longer/
They ruthlessly optimize for compile time, which values the programmer’s time over the users’ time. Since there might be thousands or millions of users, and a program might run millions of times, this never made any sense to me.
I want to let applications compile over a weekend for maximum optimization if it’s something that will run thousands or millions of times, or run for thousands of hours. It would be great to have an alternative compiler for Go, a true and modern optimizing compiler. I’ve thought about the business prospects for one, a proprietary compiler and toolchain – seems like there’s a market for one, and the Google compiler just isn’t very good. It doesn’t leverage modern CPUs, and it doesn’t seem to support Whole Program Optimization / LTO, or Profile Guided Optimization. LTO is less relevant for Go than C/C++, but there are still some opportunities.