There’s a decent amount of overhead to using CGO. It’s definitely not meant for small, quick functions. You can actually see all the intermediate C wrapper code that’s generated by passing the -work flag to “go build”.
I created a gist for a simple CGO invocation and copied all the files from the work directory:
I have had a lot of success with calling into CGO functions that perform all their work without context switching. Go structs also seem to work well when passed to C (as long as they’re not GC’d) and I’ve used some crazy mixes of Go+C+LLVM and it all works fine.
Tony Wilsonsays:
An alternative for small functions is to use the Go C compiler or even the assembler. Once you get past the slightly quirky coding, it’s quite ‘nice’. Dave Cheney did a simple intro blog on the subject if I remember correctly. Also the pkg source has a few implementations. Look for .goc files and their .go callers.
I’d be very interested if anyone could improve Will’s bitset class. Can you do better than I did?
Tony Wilsonsays:
I’ll give it a Go – just don’t hold your breath; I’m busy for the next few days.
Aaron Blohowiaksays:
Going from 12usec to 9usec would be a 33% improvement. However, I found the time of my whole program to be dominated by GC.
While Will’s library offers InPlace functions, to use them with a reused buffer requires copying which lead to a slower overall program.
Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.
Munchsays:
The only reason a C++ compiler would outperform Go in this micro-benchmark, and it is a micro-benchmark, is that the built-in compiles to a single CPU instruction on contemporary CPUs. Go doesn’t have built-ins like that. Also, because your C++ code relies on built-ins, it’s actually inlining and unrolling the loop, which Go’s optimizer isn’t doing on its own yet.
I haven’t tested it, but I’m willing to bet that you’ll get faster if you inline the popcount_2 function, and you’ll get faster still if you unroll the loop manually. The former removes the relatively expensive subroutine call overhead, and the latter amortizes the looping overhead.
Munchsays:
Actually, I just tested myself — and surprisingly, it makes only a small difference. I was able to reduce my time/op to 11.8us, which is an improvement, but not terribly significant. It’s also very finicky, and noise from multitasking swamps the improvement.
Scratch that idea. 🙂
Munchsays:
Sorry; baseline on my machine was 13.3us/op. Missed that important info. 🙂
Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.
I am not sure you are gaining anything at all by trading away this flexibility. I would like to see a benchmark.
The thing is… the software has to check whether the index is in range. What your code probably does is just fail when it is out of range. Will expands the underlying array.
It is possible that your version if faster, but there is no reason, in principle, that it should be measurably faster.
And it is a shame. It would take a few hours for the language authors to implement it and it could instantly speed up many algorithms.
Tony Wilsonsays:
At last… My tuppence worth can be found at github.com/tHinqa/bitset
Tony Wilsonsays:
Added the forgotten clz fallback code and tests. Not sure he clz 0 also undefined for the LZCNT cpu instruction but it shouldn’t be.
Tony Wilsonsays:
he=is Damn autocorrect
Tony Wilsonsays:
if Damn human error
Tony Wilsonsays:
Just found that CPUID code is missing a MOVL $1!,AX at the beginning. Will fix and comment again – sorry about that
Tony Wilsonsays:
Fixed
Tony Wilsonsays:
POPCNT instruction tested using software.intel.com/en-us/articles/intel-software-development-emulator that I came across catching up on my freshmeat/freecode feed. Nice tool. Allows you to emulate any Intel CPU. Slow though. 365 micros for POPCNT 57 for hamming. Avoid 6.2 if windows; it’s broken. Get the 2013-11 ver.
Ok. I think that’s my QED to leave the rest to people with real machines. It’s been interesting and I’m considering a package of intrinsics for Go now.
C++ is a strong incumbent, but I think Go will gradually get there in terms of raw performance.
What is more interesting and surprising is that Go has ended up eating share from Java and even Python.
Torsten Brongersays:
“In practice, Go has performance […] far inferior to C++ and Java.†– this is way too general in my opinion. In fact, language shootouts normally have two distinct cluster points for performance, and Go is in the “staticly-typed†bunch.
There are indeed many cases when Go *is* slow. But people may refer to your article and this sentence, and it is misguiding, even with its context.
This sentence was written in 2014. We are in 2017. Go has made progress compared to Java and C++. It is no longer the correct expectation that Go will run much slower than Java. At this time, Go still has serious limitations compared to Java, but a lot of work was done.
There’s a decent amount of overhead to using CGO. It’s definitely not meant for small, quick functions. You can actually see all the intermediate C wrapper code that’s generated by passing the -work flag to “go build”.
I created a gist for a simple CGO invocation and copied all the files from the work directory:
https://gist.github.com/benbjohnson/f33106b86036bcc72ba7
I have had a lot of success with calling into CGO functions that perform all their work without context switching. Go structs also seem to work well when passed to C (as long as they’re not GC’d) and I’ve used some crazy mixes of Go+C+LLVM and it all works fine.
An alternative for small functions is to use the Go C compiler or even the assembler. Once you get past the slightly quirky coding, it’s quite ‘nice’. Dave Cheney did a simple intro blog on the subject if I remember correctly. Also the pkg source has a few implementations. Look for .goc files and their .go callers.
@Tony
I’d be very interested if anyone could improve Will’s bitset class. Can you do better than I did?
I’ll give it a Go – just don’t hold your breath; I’m busy for the next few days.
Going from 12usec to 9usec would be a 33% improvement. However, I found the time of my whole program to be dominated by GC.
While Will’s library offers InPlace functions, to use them with a reused buffer requires copying which lead to a slower overall program.
Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.
The only reason a C++ compiler would outperform Go in this micro-benchmark, and it is a micro-benchmark, is that the built-in compiles to a single CPU instruction on contemporary CPUs. Go doesn’t have built-ins like that. Also, because your C++ code relies on built-ins, it’s actually inlining and unrolling the loop, which Go’s optimizer isn’t doing on its own yet.
I haven’t tested it, but I’m willing to bet that you’ll get faster if you inline the popcount_2 function, and you’ll get faster still if you unroll the loop manually. The former removes the relatively expensive subroutine call overhead, and the latter amortizes the looping overhead.
Actually, I just tested myself — and surprisingly, it makes only a small difference. I was able to reduce my time/op to 11.8us, which is an improvement, but not terribly significant. It’s also very finicky, and noise from multitasking swamps the improvement.
Scratch that idea. 🙂
Sorry; baseline on my machine was 13.3us/op. Missed that important info. 🙂
@aaron
Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.
I am not sure you are gaining anything at all by trading away this flexibility. I would like to see a benchmark.
The thing is… the software has to check whether the index is in range. What your code probably does is just fail when it is out of range. Will expands the underlying array.
It is possible that your version if faster, but there is no reason, in principle, that it should be measurably faster.
@Munch
Go doesn’t have built-ins like that.
And it is a shame. It would take a few hours for the language authors to implement it and it could instantly speed up many algorithms.
At last… My tuppence worth can be found at github.com/tHinqa/bitset
Added the forgotten clz fallback code and tests. Not sure he clz 0 also undefined for the LZCNT cpu instruction but it shouldn’t be.
he=is Damn autocorrect
if Damn human error
Just found that CPUID code is missing a MOVL $1!,AX at the beginning. Will fix and comment again – sorry about that
Fixed
POPCNT instruction tested using software.intel.com/en-us/articles/intel-software-development-emulator that I came across catching up on my freshmeat/freecode feed. Nice tool. Allows you to emulate any Intel CPU. Slow though. 365 micros for POPCNT 57 for hamming. Avoid 6.2 if windows; it’s broken. Get the 2013-11 ver.
Ok. I think that’s my QED to leave the rest to people with real machines. It’s been interesting and I’m considering a package of intrinsics for Go now.
C++ is a strong incumbent, but I think Go will gradually get there in terms of raw performance.
What is more interesting and surprising is that Go has ended up eating share from Java and even Python.
“In practice, Go has performance […] far inferior to C++ and Java.†– this is way too general in my opinion. In fact, language shootouts normally have two distinct cluster points for performance, and Go is in the “staticly-typed†bunch.
There are indeed many cases when Go *is* slow. But people may refer to your article and this sentence, and it is misguiding, even with its context.
@Torsten
This sentence was written in 2014. We are in 2017. Go has made progress compared to Java and C++. It is no longer the correct expectation that Go will run much slower than Java. At this time, Go still has serious limitations compared to Java, but a lot of work was done.
I revised the sentence to weaken it.