I think that the rationale is part of the GNU GCC documentation, it states that -O2 “performs nearly all supported optimizations that do not involve a space-speed tradeoff”. That is, the -O3 flag may grow the binary size to achieve better speed. And, of course, -O3 may fail to deliver better speed because of the trade-offs involved.
I think that if -O3 delivered consistently better performance, it would be the new default at release time, but I don’t think it works this way.
Tommasosays:
Disclaimer: this is on clang, so maybe it doesn’t matter.
Anyway, looking at the code I noticed that the memory is set to 1 only once and then re-zeroed in all subsequent benchmarks… so perhaps only the first benchmark (std::fill) is doing something.
So I swapped around the memset bench and the std::fill bench and this came out:
page count: 32, volume:0.125 MB
memset(p,0,n) : 9.2219 GB/s
std::fill(p, p + n, '\0') : 39.1376 GB/s
std::fill(p, p + n, 0) : 39.3015 GB/s
memset(p,0,n) : 39.3268 GB/s
std::fill(p, p + n, '\0') : 39.5562 GB/s
std::fill(p, p + n, 0) : 39.6719 GB/s
memset(p,0,n) : 39.4922 GB/s
std::fill(p, p + n, '\0') : 39.8012 GB/s
std::fill(p, p + n, 0) : 39.8272 GB/s
and suddenly, memset is the bad one!
In fact the first bench is the slowest almost every time on my machine.
This is on MacOS 16 + clang… maybe this happens because once a page is zeroed, it’s remapped to the read-only zero-page and subsequent 0-writes don’t do anything?
I changed the code to give each benchmark its own memory area, and that way I get about the same speed for all 3 methods, +/- some (pretty big) random fluctuations on all 3:
char* src = new char[i*3];
memset(src, 1, i * 3);
for (int z = 0; z < 3; z++) {
char* p = src;
bench2(p, i);
p += i;
bench1(p, i);
p += i;
bench(p, i);
std::cout << std::endl;
}
delete[] src;
… although the very first bench*() call (no matter which) is still at 9 gb/s. No idea why 🙂
I agree with what you wrote but it agrees with my post.
Maxim Egorushkinsays:
This is a bug in GNU C++ library std::fill/std::fill_n. Using the argument of the exact correct type for the fill value fixes the bug and makes it use memset.
On one other hand you have memset, which you need to specify the correct size in bytes, despite it taking an int fill value (specifying wrong size for memset is a common bug in stackoverflow questions).
Yes, you can get around the performance issue, as Travis explains. Whether it is a bug is debatable: it has been around forever and it is standard compliant.
But the larger point is that nothing at all in the C++ standard guarantees that you will get the best possible performance with std::fill. This is just one example of many where using the idiomatic approach will fail you.
Please check my previous posts where I show that just about the worse possible way to allocate and initialize memory is via “new()”, at least in GNU GCC.
My argument is not that you should forgo the idiomatic way, or drop C++ in favor of C. My argument is laid out in my post. (Last paragraph.)
Maxim Egorushkinsays:
Well, you are intentionally measuring performance of code hitting the library bug and yet you are not using -O3 optimization, despite targeting for top performance. It is hard to call such an approach scientifically sound, IMHO.
The first link in the blog post is to Travis’ blog post, where he describes the issue in details. I have the following line “He also explains how to fix the C++ so that it is as fast as the memset function.”
Do you disagree with the point I am making? The point is the following: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.
If you disagree, I have many more arguments. In fact, I have literally hundreds of blog posts on this topic over 15 years.
This happens all the time. It is actually kind of difficult to get bare metal performance out of C++ systematically. It is comparatively quite easy to write idiomatic C++.
Maxim Egorushkinsays:
You are trying to write idiomatic code, but you made an intentional mistake in it by using a wrong type, which resulted in the sub-optimal code path. And to add insult to injury you pessimize it with -O2 instead of -O3.
So, you have idiomatically looking code with a subtle bug in it and that’s what you benchmark.
Please be clear, Maxim, about your disagreement. Here is my statement: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.
Do you agree or do you disagree?
Maxim Egorushkinsays:
“Writing your code in idiomatic C++ is not sufficient to guarantee good performance”. – I agree, just as well as I agree with “writing your code in non-idiomatic C++ is not sufficient to guarantee good performance either”.
In both cases the code must be bug free, however subtle the bugs are.
May be you are trying to say that passing literal 0 into memset doesn’t create a subtle performance bug, unlike with std::fill, and that can conditionally be true. People compiling with -O3 (like me) wouldn’t experience that bug either.
I value information from most of your posts, nevertheless.
Note compilers would become much better quickly if people take the time to report optimization issues rather than just complaining about them in various blogs…
Note compilers would become much better quickly if people take the
time to report optimization issues rather than just complaining about
them in various blogs…
I am not exactly sure how it works regarding GNU GCC. For example, I know how to double the speed of the random shuffle and I sent a patch. I’m looking forward to see whether it makes it.
Note that the point of my post was not to complain about the compilers.
Wilcosays:
Call it what you like! It is interesting to point out that various C++ constructs can have very different performance characteristics across compilers and even different versions of the same compiler.
However these are examples of optimization inefficiencies which should not happen and are easily fixable in compilers – if only we hear about them!
GCC has a bugzilla where you can report bugs or optimization issues. Contributing patches to GCC requires a copyright assignment, and to get your patches accepted you may need to post them repeatedly if people are busy and they don’t get any attention…
Contributing patches to GCC requires a copyright assignment
Yes, I did the copyright assignment last year. It has been accepted.
Travis Downssays:
My experience across a few compiler and similar projects is that most reported issues are either ignored, or initially discussed then subsequently ignored.
Even compiler developers admit that the best way to get something fixed is to complain about it in a visible way (not saying Daniel is doing this), at which point it gets fixed.
Wilcosays:
Well the best way to get things done in an open source project is to be extremely proactive, so ask questions, file bugs, post patches etc. There are always reports about trivial things which don’t matter much for performance.
A common example is emitting a redundant move in a small function (usually due to argument passing, register allocation/scheduling interactions etc). 100% optimal code is as impossible as perfect register allocation or scheduling.
Compiler experts are busy, and prefer to spend their time on things that are achievable and make a measurable improvement!
Francesco Biscanisays:
I think it is at least arguable that using
std::fill(p, p + n, 0);
where p is a char pointer is not really idiomatic C++, because std::fill is defined in terms of assignments and thus is pretty clear that (formally) a type conversion is going to take place. Still, it’s clearly desirable for an implementation to detect and optimise this usage as well.
(This example reminds me a bit of how people regularly misuse std::iota, std::accumulate, etc. by using the wrong type for the initializer)
At -O3 level GCC insert a call to memset. I checked at godbolt that this optimization is available since GCC 4.8.
True, but my guess is that most code does not get compiled with -O3.
Interesting. Any particular reason -O3 is not used? My first guess would be compile times for large projects?
I think that the rationale is part of the GNU GCC documentation, it states that -O2 “performs nearly all supported optimizations that do not involve a space-speed tradeoff”. That is, the -O3 flag may grow the binary size to achieve better speed. And, of course, -O3 may fail to deliver better speed because of the trade-offs involved.
I think that if -O3 delivered consistently better performance, it would be the new default at release time, but I don’t think it works this way.
Disclaimer: this is on clang, so maybe it doesn’t matter.
Anyway, looking at the code I noticed that the memory is set to 1 only once and then re-zeroed in all subsequent benchmarks… so perhaps only the first benchmark (std::fill) is doing something.
So I swapped around the memset bench and the std::fill bench and this came out:
page count: 32, volume:0.125 MB
memset(p,0,n) : 9.2219 GB/s
std::fill(p, p + n, '\0') : 39.1376 GB/s
std::fill(p, p + n, 0) : 39.3015 GB/s
memset(p,0,n) : 39.3268 GB/s
std::fill(p, p + n, '\0') : 39.5562 GB/s
std::fill(p, p + n, 0) : 39.6719 GB/s
memset(p,0,n) : 39.4922 GB/s
std::fill(p, p + n, '\0') : 39.8012 GB/s
std::fill(p, p + n, 0) : 39.8272 GB/s
and suddenly, memset is the bad one!
In fact the first bench is the slowest almost every time on my machine.
This is on MacOS 16 + clang… maybe this happens because once a page is zeroed, it’s remapped to the read-only zero-page and subsequent 0-writes don’t do anything?
I changed the code to give each benchmark its own memory area, and that way I get about the same speed for all 3 methods, +/- some (pretty big) random fluctuations on all 3:
char* src = new char[i*3];
memset(src, 1, i * 3);
for (int z = 0; z < 3; z++) {
char* p = src;
bench2(p, i);
p += i;
bench1(p, i);
p += i;
bench(p, i);
std::cout << std::endl;
}
delete[] src;
… although the very first bench*() call (no matter which) is still at 9 gb/s. No idea why 🙂
I run the benchmark three times for a reason! 🙂
The first bench will run slowly simply because of the pagefaults to allocate and zero the memory.
I don’t think your analysis is correct, you missed the main point of the original post which is using the right type for the fill value.
Here is
fill
with correct type and -O2 optimization compiling down tomemset
on godbolt.How is my analysis wrong?
I agree with what you wrote but it agrees with my post.
This is a bug in GNU C++ library
std::fill
/std::fill_n
. Using the argument of the exact correct type for the fill value fixes the bug and makes it usememset
.On one other hand you have
memset
, which you need to specify the correct size in bytes, despite it taking anint
fill value (specifying wrong size formemset
is a common bug in stackoverflow questions).Yes, you can get around the performance issue, as Travis explains. Whether it is a bug is debatable: it has been around forever and it is standard compliant.
But the larger point is that nothing at all in the C++ standard guarantees that you will get the best possible performance with std::fill. This is just one example of many where using the idiomatic approach will fail you.
Please check my previous posts where I show that just about the worse possible way to allocate and initialize memory is via “new()”, at least in GNU GCC.
My argument is not that you should forgo the idiomatic way, or drop C++ in favor of C. My argument is laid out in my post. (Last paragraph.)
Well, you are intentionally measuring performance of code hitting the library bug and yet you are not using
-O3
optimization, despite targeting for top performance. It is hard to call such an approach scientifically sound, IMHO.The first link in the blog post is to Travis’ blog post, where he describes the issue in details. I have the following line “He also explains how to fix the C++ so that it is as fast as the memset function.”
Do you disagree with the point I am making? The point is the following: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.
If you disagree, I have many more arguments. In fact, I have literally hundreds of blog posts on this topic over 15 years.
This happens all the time. It is actually kind of difficult to get bare metal performance out of C++ systematically. It is comparatively quite easy to write idiomatic C++.
You are trying to write idiomatic code, but you made an intentional mistake in it by using a wrong type, which resulted in the sub-optimal code path. And to add insult to injury you pessimize it with
-O2
instead of-O3
.So, you have idiomatically looking code with a subtle bug in it and that’s what you benchmark.
Please be clear, Maxim, about your disagreement. Here is my statement: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.
Do you agree or do you disagree?
“Writing your code in idiomatic C++ is not sufficient to guarantee good performance”. – I agree, just as well as I agree with “writing your code in non-idiomatic C++ is not sufficient to guarantee good performance either”.
In both cases the code must be bug free, however subtle the bugs are.
May be you are trying to say that passing literal
0
intomemset
doesn’t create a subtle performance bug, unlike withstd::fill
, and that can conditionally be true. People compiling with-O3
(like me) wouldn’t experience that bug either.I value information from most of your posts, nevertheless.
Thanks for the good words.
This issue with std::fill has already been fixed in latest GCC: https://godbolt.org/z/ctoECs
Note compilers would become much better quickly if people take the time to report optimization issues rather than just complaining about them in various blogs…
I am not exactly sure how it works regarding GNU GCC. For example, I know how to double the speed of the random shuffle and I sent a patch. I’m looking forward to see whether it makes it.
Note that the point of my post was not to complain about the compilers.
Call it what you like! It is interesting to point out that various C++ constructs can have very different performance characteristics across compilers and even different versions of the same compiler.
However these are examples of optimization inefficiencies which should not happen and are easily fixable in compilers – if only we hear about them!
GCC has a bugzilla where you can report bugs or optimization issues. Contributing patches to GCC requires a copyright assignment, and to get your patches accepted you may need to post them repeatedly if people are busy and they don’t get any attention…
Contributing patches to GCC requires a copyright assignment
Yes, I did the copyright assignment last year. It has been accepted.
My experience across a few compiler and similar projects is that most reported issues are either ignored, or initially discussed then subsequently ignored.
Even compiler developers admit that the best way to get something fixed is to complain about it in a visible way (not saying Daniel is doing this), at which point it gets fixed.
Well the best way to get things done in an open source project is to be extremely proactive, so ask questions, file bugs, post patches etc. There are always reports about trivial things which don’t matter much for performance.
A common example is emitting a redundant move in a small function (usually due to argument passing, register allocation/scheduling interactions etc). 100% optimal code is as impossible as perfect register allocation or scheduling.
Compiler experts are busy, and prefer to spend their time on things that are achievable and make a measurable improvement!
I think it is at least arguable that using
std::fill(p, p + n, 0);
where p is a char pointer is not really idiomatic C++, because std::fill is defined in terms of assignments and thus is pretty clear that (formally) a type conversion is going to take place. Still, it’s clearly desirable for an implementation to detect and optimise this usage as well.
(This example reminds me a bit of how people regularly misuse std::iota, std::accumulate, etc. by using the wrong type for the initializer)