I am certainly not an expert in C++. However, if I remember correctly, std::endl is a lot slower than using \n. Of course, you may need to use std::endl. I wonder how the benchmark changes when using \n?
Anonymous Cowardsays:
This isn’t exactly news. The C++ specific printing facilities are known to be less efficient than plain old println(), and have been known to be slower decades.
Jonas Minnbergsays:
Remove the std::endl and put the \n in the string like the C version, and it should go faster…
mariuszsays:
Exactly.
Alf Peder Steinbachsays:
Well that’s a false meme, associative thinking. `endl` just causes a call of `flush`. At some point before end of `main` the stream is flushed anyway, so, net win = one function call and check.
Schrodinbugsays:
No…I think that’s fake news… I’ve heard a lot of people say that std::endl is a new line with a flush, but that either isn’t exactly true or at least implementation defined.
John Keublersays:
I try to stick with C and use macros if needed to enhance the language. There is a thing for sticking with simplicity. C++ is to complicated and bloated. OOP is Ok but I much better perfer functional programming using just functions.
Mirkosays:
This code is a really bad comparison.
This gives the idea that C++ is bloated and slower (it is not, actually it is faster in real code than C).
And then you have people like this coding in the stone age justified with memes.
When problems are large or complex, the OO C++ features simplify your code to a very large extent.
Functions are fine, but associating them with the proper data is cumbersome in C, simple and scalable in C++, based on Classes, their extensions or generalization, their relationships, and their instances. Abstraction is the reason why C++ was created, and it delivers that, hence the power and simplicity of its code.
Real “Functional Programming” isn’t supported by languages as basic as C. Consider exploring languages that are built for Functional Programming, they would give you more power in a world you already like.
Alex Chensays:
Isn’t this, to some extent, testing the streaming IO part of the STL in C++, instead of the language itself? For what it’s worth, std::cout and std::endl probably does more (like flushing the cache) than printf under the hood, which could potentially account for the 1ms increase in execution time.
Chrissays:
It is a well established fact that C++ does not provide a zero overhead abstraction unfortunately.
Note that many features of C++ in fact do provide (+-) zero overhead abstractions.
I have a concern about your conclusion here — not that it’s necessarily wrong, but that this test is incomplete. Specifically, this test does nothing to differentiate between execution time and function call time.
If we’re looking at 1 ms overhead every time you print to console, I’ll grant that’s significant. But if we’re looking at 1 ms per execution? I can’t rightfully agree with your conclusion that this is significant. Yes, granted, we’re talking about a 200% increase in the execution time for Hello World, but in 2022, I cannot think of a real-world situation where anyone would be executing hello-world equivalent software with such frequency that it creates a cpu bottleneck. Not even in the embedded space.
I haven’t tested it yet (I might), but my guess is the performance difference you’re seeing takes place in loading the module, and if you were to print to console 10,000 or 100,000 times per execution, you’d still be looking at about a 1 ms difference per execution. I’m basing this guess on the fact that we’re seeing such a significant performance increase in the statically linked c++ version and the knowledge that in a Linux environment, there’s some decent chance that stdio.h is preloaded in memory while iostream is not.
Obviously, my hunches are not data, and more testing is required before we draw any conclusions here.
The other question I have is whether you’re running hyperfine with the -N flag. Without it, on processes this short, it’s kicking the following warning at me:
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`–shell=none` option to disable the shell completely.
Which seems potentially relevant.
I might be back later with followup results.
Pasays:
Endl is slower than “\n”, you should try it again to see if it makes any difference.
Charlessays:
Try removing stdlib in both programs. Return 0 instead. Also use \n in the cpp program instead of endl. Would be interested in seeing the results of that
Jakob Kendasays:
There is a difference in your C++ code as opposed to C code, and that is the std::endl statement, which flushes stdout. There is no flushing in the C code. For the code to be equivalent, the C++ statement should be
std::cout << "hello world\n";
I’ve done some followup testing. It appears that my concerns with the methodology wire unfounded, but I have since seen some other critique of your methodology that I have not explored.
In your updated C++ code (multi_hello.cpp), you should also replace std::endl with “\n” as previously suggested here. I suspect this may have a much larger impact on the results due to flushing after each print for 30000 iterations.
Interested in seeing updated results!
Mark Rohrbachersays:
Hello Mr. Lemire,
IMHO, the comparison of those two snippets isn’t very fair, as the C++ code does a bit more than the C code:
To make the two programs more comparable, you should either replace the C++ streaming with
std::cout << "hello world\n";
or add a
fflush(stdout);
to the C program.
In my tests, both hellocppstatic and hellocppfullstatic were faster than helloc, with both of these changes, hellocpp was slower. However, as my machine wasn't completely idle, these results may be inaccurate.
But let's go a step ahead:
If you omit the printf / flush / cout streaming, just leaving the "return EXIT_SUCCESS" (and the includes), the C++ program will most probably be slower. This is because of the static initialization of std::ios_base (std::ios_base::Init::Init() gets called on program startup as soon as gets included).
It’d be interesting to see the results after removing this include, as the object code of the hello.c and hello.cpp should be totally equal.
Best regards
– Mark
Matti Laasays:
“This is because of the static initialization of std::ios_base (std::ios_base::Init::Init() gets called on program startup as soon as gets included)”
This. Static initialization and destruction is made if iostream header is just included, and even not used. Using stdio.h instead of iostream and printf gives you exactly the same result of assembly between these two languages. Latest GCC release output:
.LC0:
.string “Hello world”
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
But yeah, overall I think this is a good example that all the features in C++ over C are not here for free. You have to understand how using your libraries, your code (of course;) and sometimes even how compilers work, if optimizing CPU usage is your priority one.
Tim Parkersays:
This is such a beautiful example of measuring something, and yet understanding almost nothing about what they mean, I shall be using this as an example for our new starters on the pitfalls of premature optimisation and the importance of meaningful test structures and data.
Davesays:
This is based on biased info from decades ago.
99.99999% of C++ programs used for professional applications in this world do not use standard out (or err) to convey runtime status.
C++ apps are easier to develop than C, have more rich features, so I’m not sure what you are driving at.
Oh, C++ apps are oftentimes deployed in embedded (or server) environments…. where there is definitely no I/O to a terminal.
The blog post is specifically about “hello world”.
If you mean to refer to large programs, then I agree, but it happens often enough that we have to run small commands that only do a few microseconds of work.
Tim Parkersays:
You’re not measuring what you think you are.
Justin M. LaPresays:
Did you try passing -fno-exceptions and -fno-rtti? That may impact your numbers as well.
Evan Teransays:
The C++ program is doing more work than the C program.
You should avoid using `std::endl unless you specifically intend to flush the buffers explicitly. There’s nothing wrong with using a simple newline character.
But also, IO streams are known to be measurably slower than printf. Especially since it has hidden global constructors and destructors.
std::format is the new modern way to write formated strings.
So, it’s not really that “hello world is slower in C++”, it’s that The methods that you’ve chosen to perform the task in C++ are by nature slower (But offer better type safety and internationalization capabilities).
For the simple task of printing”hello world”, honestly you should just use puts.
<< std::endl inserts a newline AND flushes the stout buffer, which I don't believe printf() does.
It would be interesting to see the comparison without << std::endl, since flushing the buffer is a relatively costly operation, it should give you a better apples to apples comparison. I'm no expert though.
This is not an accurate, endl also includes a flush, which is no longer necessary in c++, and adds unnecessary time. You could have just as easily used “\n” in the c++ version the same way you did in the c version.
yueshansays:
cout do lots of things you should know
Jeff Baileysays:
iostreams are not a minor bit of infrastructure.
If.you want to compare program startup time, use printf in the C++ version as well.
You should be able to look at the assembly output to make a good comparison. That’s a better view of what’s happening and why
Richard Cervinkasays:
There is no difference between std::endl ant ‘/n’ because std::cout is flushed at the end of the application.
zahirsays:
IMHO it is all about linking with the libstdc++. In the first version of the code I did only replaced std::cout… line with printf line from the C version (without changing includes or linking directives) and the results for C++ did not change on my computer.
I ran a perf record/report on that version and unlike C, at least 30% time was being lost on locale functionality. My guess is not linking to libstdc++ removes underlying C++ locale functionality from printf.
Measurements were on my 10 year old machine.
I wonder what will change if we link with/to clang/libc++ though.
If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.
I’m glad neither I, nor my children, attended the University of Quebec if this is how professors spend their time. You conclude that:
“.. if these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.”
then, in a later comment response, state:
“The blog post is specifically about “hello world”.”
If it’s the latter, then the former conclusion is invalid. You cannot infer that tiny programs under Linux will perform slower, using C++ rather than C, on the basis of a one line example where the method used is different.
There are multiple comments addressing the specifics of the differences, and reasons for them, but, if I were you, I’d take this blog post down as it makes you look foolish.
Niclassays:
I have a lot of respect for your work, so this blog post is quite baffling & sadning . -What exactly are you getting at or aiming for ?
”there may a significant penalty due to textbook C++ code for tiny program, under Linux.”
-BS, & you’re comparing apples to oranges . readup what cout actually does. Is your printf thread safe? (you can turn of sync_with_io for the std streams if you want that monster to be faster). std::printf is also maybe worth mentioning
Mirkosays:
This code doesn’t show C++ being slower than C.
Rather, this is “iostream with stdio sync on printing two strings” being slower than “printf for the trivial case of a string”. No news here.
Druesays:
Everyone else has already mentioned how flawed this is.
But a better test would be to compare two computationally intensive algorithms or generics, written properly in each language.
Students using C++ streams in programming contests are hammered to prolog main with
[code]
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
cout.tie(nullptr);
[/code]
. . . in order to achieve printf/scanf performance.
Thanks wqw! I was aware of sync_with_stdio but i’ve never seen tie before.
It’s always a pleasure to learn something I could use someday 🙂
tetsuoiisays:
What you have discovered is just the tip of the craptastic bloatberg that is every other language not C.
Jack Mazierskisays:
If all you write is hello world then all you need is C.
Only that we are not in 1992. C is quite useless for user mode apps nowadays and no one creates console apps except Linux freaks that have nothing else to write.
This is a micro-benchmark that illustrates a simple point. I do not believe Daniel is going after any massive generalizations.
Oh. And all the comments about flushing the I/O buffer … a moment of thought should have told you the examples were equivalent. While it have been a couple decades since I dug into runtime libraries, pretty sure every runtime must flush buffers on program exit.
Put differently…
Did you see the output?
Then the runtime library flushed output buffers on exit.
Yes, loading dynamic libraries is more expensive. Often this does not matter, but sometimes it can be significant. There is or should be a savings in memory used (across multiple programs using the same libraries), and this can sometimes be significant.
The savings from shared dynamic libraries was critical in the Win16 era, and for some time after. In present many-gigabyte machines, rather less so. (In this century, have tended to use static libraries more often than dynamic.)
The C printf() and stdio library was honed decades ago on much leaner machines, and (as you might expect) is lean and efficient. If you dig back into the USENET archives, you can find a period (late-1980s / early 90s?) where there was a bit of a public competition to see who could come up with the leanest stdio library. That code ended up in compiler runtime libraries, and I strongly suspect survives to the present (and offers examples of hyper-optimization).
The C++ standard streams library arrived on fatter machines, and never received such attention (in part as you can use C stdio).
Daniel’s experiment matches well with history.
Tim Parkersays:
“This is a micro-benchmark that illustrates a simple point. I do not believe Daniel is going after any massive generalizations.”
With respect, the claim was made in the article that “.. if these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.”
Disregarding the strict meaning of ‘may’ – which would make the whole statement a semantic null – this is (IMO) quite a massive over generalization. It is a micro-benchmark, and a poorly considered and written at that, and there is effectively no meaningful generalization at all possible from it – as has been pointed out by many in these comments. That the author subsequently states that this was specifically about “hello world” is not properly reflected in the main text, even now.
It also seems to expose a deep lack of knowledge not only of the what the programs are doing, what the objects and functions are designed – and their benefits and deficits – but also of C++.
I’ve seen renderers and whole micro-kernels constexpr’d – which is harder to do in C – and could result in enormous performance benefits, but that’s not the point, nor is it necessarily a reason to choose one language over the other. They were particular implementations, for particular purposes, and but do demonstrate aspects of a language that could be useful in many situations, but which should not be over generalized from. This is the most egregious issue for me, the apparent attempt to classify language suitability on a frankly meaningless code snippet which is hardly an example of any useful real-world program – this is something we try to stamp out from even the newest of starters, and from a professor of computer science seems quite ridiculous to me. YMMV obviously.
Stephen Transays:
Can you do a test to show that for small programs similar to “Hello World” but not necessarily the same C++ runs as fast as C if not faster? This would settle the issue. Wouldn’t it?
Tim Parkersays:
At an extreme, you could try something like this https://onecompiler.com/cpp/3wdmzd9js
(or trying Googling for ‘constexptr fibonacci’)
Re-working that as C should give an indication of what can be done, but – like the article – it’s really missing the point, and I could probably equally well make a C++ version that is far worse **.
One of the main reasons that individual, micro-benchmarks like this aren’t useful for answering questions like “is language X faster than language Y ?” is that the question is completely meaningless.
What we can do is ask, for my particular problem space *and* my typical data sets / operating conditions – what would a good choice of language and strategy be ? If, for example, you were designing an ultra-high speed / low latency peer-to-peer message passing system, you probably wouldn’t choose Python. However if you wanted to implement a simple peer-to-peer client-server application then Python, with it’s interpreted nature and rich library support, would make such a thing relatively trivial. It’s exactly these sort of evaluations that should be driven into programmers, and first year computer science students in Quebec and elsewhere, from day one.
Using massively simplified, and atypical, noddy code fragments -especially when naively implemented – is not really helpful or instructive, and mainly serves to teach people bad coding practice and poor performance analysis techniques IMO.
** These are poor Fibonacci number generators, so don’t use them in any performance sensitive regime, they just an example 🙂
rhpvordermansays:
The C printf() and stdio library was honed decades ago on much leaner machines,
It is still being honed. Memchr for instance uses sse2 instructions on x86-64 machines. These instructions were available only long after both c and c++ gained widespread adoption. Memchr beats std::find https://gms.tf/stdfind-and-memchr-optimizations.html
Glibc is much more optimized than libstdc++ simply because it is much smaller,and therefore developers can devote more time to optimization.
The truth is that abstractions come at a cost of complexity and size which makes it harder to optimize. “Zero-cost abstractions” may be true in a few cases, but there will always be cases that are too hard or time-consuming to look into. It is a simple matter of tradeoffs.
Catronsays:
You didn’t provide what compiler you used. I assume it was gcc. In gcc “printf” is one of the built-in functions. This means there is no library involved at all (neither dynamic nor static). It’s practically part of the language and the #include is just for syntax reasons.
I didn’t read the internals but I assume that gcc doesn’t call a classical printf at all put optimises it on the compiler level, e.g. do the formating at compile-time and use the ‘write’ syscall directly.
This program will run faster than any C program written using the standard qsort library function:
#include
#include
#include
#include
using namespace ::std;
constexpr int ipow(int base, int exp)
{
if (exp == 0) {
return 1;
} else if (exp < 0) {
return 0;
} else if (exp % 2) {
return base * ipow(base, exp – 1);
} else {
return ipow(base * base, exp / 2);
}
}
int main()
{
vector foo(ipow(2,30));
random_device rd; //Will be used to obtain a seed for the random number engine
mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
uniform_int_distribution dis;
generate(foo.begin(), foo.end(),
[&gen, &dis]() {
return dis(gen);
});
cerr << "Sorting.\n";
sort(foo.begin(), foo.end());
return is_sorted(foo.begin(), foo.end()) ? 0 : 1;
}
It is true that C++ has many advantages over C as far as algorithmic implementation goes.
Your program allocates gigabytes of memory and sorts it. If you reduce the task to sorting 12 numbers, the answer might be different, and that’s the motivation of my blog post.
It isn’t that the implementation of sort is better in C++, it’s that you can’t reasonably make a version of the qsort function in C that runs faster than sort in C++.
And this is because the sort function in C++ is a template, and the compiler essentially writes you a custom one for the data structure and comparison function you’re using in which the comparison and swap functions are inlined into sort and then subjected to aggressive optimizations.
Making this happen in C would require macro magic of the highest order, and even then would probably be a huge pain to use correctly.
Your “Hello world” case reads like a general criticism of C++, when I strongly suspect that C++ is faster than C in most cases because of things like I just mentioned. So, it seems like a criticism that’s narrowly tailored to make a point that I don’t think is particularly accurate.
After an extended play with this I would say that the library/flushing issues that most commenters aren’t anything to worry about. All the significant difference in timings seem to be due to the dynamic linker.
C code that dynamically links to libc takes ~240µs, which goes down to ~150µs when statically linked. A fully dynamic C++ build takes ~800µs, while a fully static C++ build is only ~190µs. Across all of these, the different between printing one “hello world” vs 1000 is only ~20µs.
Getting good timings was the hardest thing here! Code/analysis are in:
Sorry to post more than one comment. I would go back and edit my original if I could….
The reason that C++ is taking longer here is that the runtime environment of C++ is more complex. Not a LOT more complex, but, it is more complex. C++ has global constructors and destructors that need to be executed on program startup and shutdown. Additionally, the compiler needs to track which global destructors need to be called because (in the case, for example, of local ‘static’ variables in side functions) which ones need to be called can only be determined at runtime. This requires a global data structure that’s initialized on program startup and scanned on program shutdown.
Additionally, there will be some overhead required to set up the exception handler of last resort.
I have a hello world written in C++ that will execute faster than C, but it requires passing lots of compiler options to turn off the compiler’s setup of the C++ runtime environment. It would be possible to duplicate this program in C, but it would be challenging, especially with the quality of error handling it’s possible to achieve using my library:
I guess my response would be “duh”. With C you have very little object code and a single call to the printf function in a statically loaded library. With C++ you have substantial startup overhead loading the iostream library and all the modules it depends on. Address allocation takes time and it’s going to prepare for all the possible dynamic libraries you might load as well.
Menotyousays:
Comments are Better than the article IMO
Schrodinbugsays:
Iostreams are known to have performance issues so this isn’t earth shattering news. I’m glad you mention std::format. libfmt which that evolved from has a function called fmt::print…I guarantee fmt::print(“hello world\n”); will not just be as fast as printf, but faster. Especially if there’s a lot of formatting to be done. This is because it can do some of the formatting work at compile time. And it’s typesafe, so no having to worry and remember the gazillion printf variations. It’s freaking amazing. The print function didn’t make it to c++20, but I believe it’s being pushed for in c++23.
Cal Graysays:
The title should be “C++ streams are slower than printf” which is a known fact as streams favor versatility over performance. Streams are significantly different to print functions since formatting is stored as state within the stream object and takes time to construct and destruct once for the program lifetime.
sl2says:
How about a test where you simply output the exact same code compiled in c and c++ …. this will account for the apple to orange comparison…
also store the time before and after in μSec or a GetTickCount() before and after the call….. and output this data at the end….. this will account for startup lib / runtime difference….
Amin Yahyaabadisays:
Instead, I suggest you to use libfmt instead. It is safer and faster. Also, note that the newer C++ standard has replaced iostreams with better alternatives. If you are micro-optimizing, you should consider these details.
Roman Avtukhoffsays:
Because used std::
Marcossays:
It would be more accurate if you printed several thousand lines in a loop. The execution time of printing a single line could easily be confused with loading and startup time. Then, there is also the flushing. You want to make sure that you are flushing the same number of times.
Lastly, given the object oriented nature of C++, it would make sense to turn on optimisations.
The programs are compiled with optimization (-O2).
Tim Parkersays:
As long as you’re not trying to measure the performance the languages can offer under GCC, that might be adequate. If you’re wanting to try to replicate what typical release production code would do, then that’s probably not (partially depending on the functionality being used).
Jimmy Ellisonsays:
You could throw in some dirty inline assembly lines involving kernel syscall to improve performance.
Raffaello Bertinisays:
Doing a benchmark of such small time can be tricky, even only for various caches.
It is not a good test neither, as 1ms difference (in 1 run?) for doing no processing at all with data, has no significance nor a value to use language A rather than B.
No one in the world could be interested in investing to save 1ms for a program that is doing nothing. Because it resolves no problems (but actually it creates one 😂)
If you want a real comparison of some small routines (this one used for the test it isn’t, but it could be used ) and if you can’t profile them, the way to go is to look at the generated assembler code.
Than on that can be done an analysis, test, benchmark and writing some conclusion.
Beside focusing on a real problem, it could hp better do this kind of comparison c/c++
Raffaello Bertinisays:
One small improvement on this test could be to take out the “net weight” computing the “tare”:
Run both programs with empty main
Then hello world and do the net time execution of the hello world.
Here the tricky part is the include statement should be present or not?
Beside the expectation should be that the 2 empty programs take the same time to run,
Otherwise it implies that hello world itself isn’t faster or slower, but there is some bootstrap overhead.
Anyway.
I Enjoyed the post.
Thanks
Jonathansays:
Incorrect. This isn’t valid performance analysis. In fact, using GCC 11.2 with -Ofast, std::cout is 1.1x faster. At -O3, they are the same.
As the post as been significantly re-written, the emphasis of the slow-down altered, and a number of the criticisms folded into the text as original text, it might be nice to address more of that in replies/updates to the comments and/or acknowledged in the new article text as appropriate.
This has been done is a couple of cases, but not all – and this puts the revised text at odds with the historical comments.
The issue of relevance and suitability of the micro-benchmark as-is is not really dealt with either (e.g. if the absolute time was important you would profile, adjust, iterate – if it’s not important, it’s not important), but that’s another matter.
Thanks. Unfortunately people repeatedly proposed alternative explanations, without running benchmarks themselves nor accounting for the critical point that my post makes: the speed dramatically increased after statically linking. I have added a paragraph to acknowledge these additions as you suggested. I do thank the various readers for their proposals but I am not going to answer point-by-point dozens of closely related comments.
Regarding the relevance of the benchmark, I have explained and re-explained it at length. For long running processes, the issue has always been irrelevant, but if you have short running processes (executing in about a millisecond or less), then you may be spending most of your time loading the standard library. You may not care, of course… but it is useful to be aware. There are solutions such as static linking, but there are tradeoffs.
What this shows (and really all this shows) is that the C++ library is a lot bigger than the C library. If you are not using the capability it provides, it is costing you performance.
On the other hand, if you have significant work on a complex problem, you can have better performance because the library and the language provides facilities that would be difficult and expensive to write in C.
Also, using printf instead of std::cout does not seem to help C++.
It does, if you remove #include header
I am certainly not an expert in C++. However, if I remember correctly, std::endl is a lot slower than using \n. Of course, you may need to use std::endl. I wonder how the benchmark changes when using \n?
This isn’t exactly news. The C++ specific printing facilities are known to be less efficient than plain old println(), and have been known to be slower decades.
Remove the std::endl and put the \n in the string like the C version, and it should go faster…
Exactly.
Well that’s a false meme, associative thinking. `endl` just causes a call of `flush`. At some point before end of `main` the stream is flushed anyway, so, net win = one function call and check.
No…I think that’s fake news… I’ve heard a lot of people say that std::endl is a new line with a flush, but that either isn’t exactly true or at least implementation defined.
I try to stick with C and use macros if needed to enhance the language. There is a thing for sticking with simplicity. C++ is to complicated and bloated. OOP is Ok but I much better perfer functional programming using just functions.
This code is a really bad comparison.
This gives the idea that C++ is bloated and slower (it is not, actually it is faster in real code than C).
And then you have people like this coding in the stone age justified with memes.
C is not functional at all. C is procedural.
I can show you C++ programs that run rings around their C counterparts.
Please will you mail me some samples at [email protected]
When problems are large or complex, the OO C++ features simplify your code to a very large extent.
Functions are fine, but associating them with the proper data is cumbersome in C, simple and scalable in C++, based on Classes, their extensions or generalization, their relationships, and their instances. Abstraction is the reason why C++ was created, and it delivers that, hence the power and simplicity of its code.
Real “Functional Programming” isn’t supported by languages as basic as C. Consider exploring languages that are built for Functional Programming, they would give you more power in a world you already like.
Isn’t this, to some extent, testing the streaming IO part of the STL in C++, instead of the language itself? For what it’s worth, std::cout and std::endl probably does more (like flushing the cache) than printf under the hood, which could potentially account for the 1ms increase in execution time.
It is a well established fact that C++ does not provide a zero overhead abstraction unfortunately.
Note that many features of C++ in fact do provide (+-) zero overhead abstractions.
I think a fair comparison would be to do like so:
int main() {
std::ios_base::sync_wyth_stdio(false);
std::cout << "hello world\n";
return EXIT_SUCCESS;
}
Can you try the benchmark with this C++ implementation?
I have a concern about your conclusion here — not that it’s necessarily wrong, but that this test is incomplete. Specifically, this test does nothing to differentiate between execution time and function call time.
If we’re looking at 1 ms overhead every time you print to console, I’ll grant that’s significant. But if we’re looking at 1 ms per execution? I can’t rightfully agree with your conclusion that this is significant. Yes, granted, we’re talking about a 200% increase in the execution time for Hello World, but in 2022, I cannot think of a real-world situation where anyone would be executing hello-world equivalent software with such frequency that it creates a cpu bottleneck. Not even in the embedded space.
I haven’t tested it yet (I might), but my guess is the performance difference you’re seeing takes place in loading the module, and if you were to print to console 10,000 or 100,000 times per execution, you’d still be looking at about a 1 ms difference per execution. I’m basing this guess on the fact that we’re seeing such a significant performance increase in the statically linked c++ version and the knowledge that in a Linux environment, there’s some decent chance that stdio.h is preloaded in memory while iostream is not.
Obviously, my hunches are not data, and more testing is required before we draw any conclusions here.
The other question I have is whether you’re running hyperfine with the -N flag. Without it, on processes this short, it’s kicking the following warning at me:
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`–shell=none` option to disable the shell completely.
Which seems potentially relevant.
I might be back later with followup results.
Endl is slower than “\n”, you should try it again to see if it makes any difference.
Try removing stdlib in both programs. Return 0 instead. Also use \n in the cpp program instead of endl. Would be interested in seeing the results of that
There is a difference in your C++ code as opposed to C code, and that is the std::endl statement, which flushes stdout. There is no flushing in the C code. For the code to be equivalent, the C++ statement should be
std::cout << "hello world\n";
Perhaps differences in lib and not lib loading.
https://twitter.com/oschonrock/status/1557092072540307456
I’m not a professional C or C++ dev but I still remember a few basics from the time I studied physics at my local university (we had C/C++ lectures).
Both endl and cout have side effects. You compare two pieces of code that don’t do the same thing. You should not expect them to run equally fast.
There are ways to reduce the side effects like NOT using endl or using ios_base::sync_with_stdio(false).
https://godbolt.org/ helps a lot if you want to know more details.
I’ve done some followup testing. It appears that my concerns with the methodology wire unfounded, but I have since seen some other critique of your methodology that I have not explored.
You can see my changes to your code and references to the additional critique on my github (https://github.com/cassieesposito/Code-used-on-Daniel-Lemire-s-blog-2022-08-09)
In your updated C++ code (multi_hello.cpp), you should also replace std::endl with “\n” as previously suggested here. I suspect this may have a much larger impact on the results due to flushing after each print for 30000 iterations.
Interested in seeing updated results!
Hello Mr. Lemire,
IMHO, the comparison of those two snippets isn’t very fair, as the C++ code does a bit more than the C code:
Streaming std::endl does not only stream a ‘\n’, it also flushes the stream (https://en.cppreference.com/w/cpp/io/manip/endl).
To make the two programs more comparable, you should either replace the C++ streaming with
std::cout << "hello world\n";
or add a
fflush(stdout);
to the C program.
In my tests, both hellocppstatic and hellocppfullstatic were faster than helloc, with both of these changes, hellocpp was slower. However, as my machine wasn't completely idle, these results may be inaccurate.
But let's go a step ahead:
If you omit the printf / flush / cout streaming, just leaving the "return EXIT_SUCCESS" (and the includes), the C++ program will most probably be slower. This is because of the static initialization of std::ios_base (std::ios_base::Init::Init() gets called on program startup as soon as gets included).
It’d be interesting to see the results after removing this include, as the object code of the hello.c and hello.cpp should be totally equal.
Best regards
– Mark
“This is because of the static initialization of std::ios_base (std::ios_base::Init::Init() gets called on program startup as soon as gets included)”
This. Static initialization and destruction is made if iostream header is just included, and even not used. Using stdio.h instead of iostream and printf gives you exactly the same result of assembly between these two languages. Latest GCC release output:
.LC0:
.string “Hello world”
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
But yeah, overall I think this is a good example that all the features in C++ over C are not here for free. You have to understand how using your libraries, your code (of course;) and sometimes even how compilers work, if optimizing CPU usage is your priority one.
This is such a beautiful example of measuring something, and yet understanding almost nothing about what they mean, I shall be using this as an example for our new starters on the pitfalls of premature optimisation and the importance of meaningful test structures and data.
This is based on biased info from decades ago.
99.99999% of C++ programs used for professional applications in this world do not use standard out (or err) to convey runtime status.
C++ apps are easier to develop than C, have more rich features, so I’m not sure what you are driving at.
Oh, C++ apps are oftentimes deployed in embedded (or server) environments…. where there is definitely no I/O to a terminal.
I suspect this article was written by a troll.
The blog post is specifically about “hello world”.
If you mean to refer to large programs, then I agree, but it happens often enough that we have to run small commands that only do a few microseconds of work.
You’re not measuring what you think you are.
Did you try passing -fno-exceptions and -fno-rtti? That may impact your numbers as well.
The C++ program is doing more work than the C program.
You should avoid using `std::endl unless you specifically intend to flush the buffers explicitly. There’s nothing wrong with using a simple newline character.
But also, IO streams are known to be measurably slower than printf. Especially since it has hidden global constructors and destructors.
std::format is the new modern way to write formated strings.
So, it’s not really that “hello world is slower in C++”, it’s that The methods that you’ve chosen to perform the task in C++ are by nature slower (But offer better type safety and internationalization capabilities).
For the simple task of printing”hello world”, honestly you should just use puts.
GCC does remove printf() and inserts puts() https://gcc.godbolt.org/z/dcx4Tz4WK
That is why it is so fast.
<< std::endl inserts a newline AND flushes the stout buffer, which I don't believe printf() does.
It would be interesting to see the comparison without << std::endl, since flushing the buffer is a relatively costly operation, it should give you a better apples to apples comparison. I'm no expert though.
give it a try without std::endl
https://youtu.be/GMqQOEZYVJQ
This is not an accurate, endl also includes a flush, which is no longer necessary in c++, and adds unnecessary time. You could have just as easily used “\n” in the c++ version the same way you did in the c version.
cout do lots of things you should know
iostreams are not a minor bit of infrastructure.
If.you want to compare program startup time, use printf in the C++ version as well.
You should be able to look at the assembly output to make a good comparison. That’s a better view of what’s happening and why
There is no difference between std::endl ant ‘/n’ because std::cout is flushed at the end of the application.
IMHO it is all about linking with the libstdc++. In the first version of the code I did only replaced std::cout… line with printf line from the C version (without changing includes or linking directives) and the results for C++ did not change on my computer.
I ran a perf record/report on that version and unlike C, at least 30% time was being lost on locale functionality. My guess is not linking to libstdc++ removes underlying C++ locale functionality from printf.
Measurements were on my 10 year old machine.
I wonder what will change if we link with/to clang/libc++ though.
Hello Lemire,
in C++, operations are synchronized to the standard C streams after each input/output.
According to cppreference (https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio), synch_with_stdio may reduce the penalty:
If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.
std::ios::sync_with_stdio(false);
std::cout << "hello world" << std::endl;
When used correctly (specifically, with ` `std::ios_base::sync_with_stdio(false);` , cout is in fact much faster than printf:
https://stackoverflow.com/questions/31524568/cout-speed-when-synchronization-is-off
I’m glad neither I, nor my children, attended the University of Quebec if this is how professors spend their time. You conclude that:
“.. if these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.”
then, in a later comment response, state:
“The blog post is specifically about “hello world”.”
If it’s the latter, then the former conclusion is invalid. You cannot infer that tiny programs under Linux will perform slower, using C++ rather than C, on the basis of a one line example where the method used is different.
There are multiple comments addressing the specifics of the differences, and reasons for them, but, if I were you, I’d take this blog post down as it makes you look foolish.
I have a lot of respect for your work, so this blog post is quite baffling & sadning . -What exactly are you getting at or aiming for ?
”there may a significant penalty due to textbook C++ code for tiny program, under Linux.”
-BS, & you’re comparing apples to oranges . readup what cout actually does. Is your printf thread safe? (you can turn of sync_with_io for the std streams if you want that monster to be faster). std::printf is also maybe worth mentioning
This code doesn’t show C++ being slower than C.
Rather, this is “iostream with stdio sync on printing two strings” being slower than “printf for the trivial case of a string”. No news here.
Everyone else has already mentioned how flawed this is.
But a better test would be to compare two computationally intensive algorithms or generics, written properly in each language.
These days you should use “`std::fmt (https://en.cppreference.com/w/cpp/utility/format/format).
It’s optimization and compile time logic should beat printf.
Students using C++ streams in programming contests are hammered to prolog main with
[code]
ios_base::sync_with_stdio(false);
cin.tie(nullptr);
cout.tie(nullptr);
[/code]
. . . in order to achieve printf/scanf performance.
https://stackoverflow.com/questions/31162367/significance-of-ios-basesync-with-stdiofalse-cin-tienull
Thanks wqw! I was aware of sync_with_stdio but i’ve never seen tie before.
It’s always a pleasure to learn something I could use someday 🙂
What you have discovered is just the tip of the craptastic bloatberg that is every other language not C.
If all you write is hello world then all you need is C.
Only that we are not in 1992. C is quite useless for user mode apps nowadays and no one creates console apps except Linux freaks that have nothing else to write.
This is a micro-benchmark that illustrates a simple point. I do not believe Daniel is going after any massive generalizations.
Oh. And all the comments about flushing the I/O buffer … a moment of thought should have told you the examples were equivalent. While it have been a couple decades since I dug into runtime libraries, pretty sure every runtime must flush buffers on program exit.
Put differently…
Did you see the output?
Then the runtime library flushed output buffers on exit.
Yes, loading dynamic libraries is more expensive. Often this does not matter, but sometimes it can be significant. There is or should be a savings in memory used (across multiple programs using the same libraries), and this can sometimes be significant.
The savings from shared dynamic libraries was critical in the Win16 era, and for some time after. In present many-gigabyte machines, rather less so. (In this century, have tended to use static libraries more often than dynamic.)
The C printf() and stdio library was honed decades ago on much leaner machines, and (as you might expect) is lean and efficient. If you dig back into the USENET archives, you can find a period (late-1980s / early 90s?) where there was a bit of a public competition to see who could come up with the leanest stdio library. That code ended up in compiler runtime libraries, and I strongly suspect survives to the present (and offers examples of hyper-optimization).
The C++ standard streams library arrived on fatter machines, and never received such attention (in part as you can use C stdio).
Daniel’s experiment matches well with history.
“This is a micro-benchmark that illustrates a simple point. I do not believe Daniel is going after any massive generalizations.”
With respect, the claim was made in the article that “.. if these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.”
Disregarding the strict meaning of ‘may’ – which would make the whole statement a semantic null – this is (IMO) quite a massive over generalization. It is a micro-benchmark, and a poorly considered and written at that, and there is effectively no meaningful generalization at all possible from it – as has been pointed out by many in these comments. That the author subsequently states that this was specifically about “hello world” is not properly reflected in the main text, even now.
It also seems to expose a deep lack of knowledge not only of the what the programs are doing, what the objects and functions are designed – and their benefits and deficits – but also of C++.
I’ve seen renderers and whole micro-kernels constexpr’d – which is harder to do in C – and could result in enormous performance benefits, but that’s not the point, nor is it necessarily a reason to choose one language over the other. They were particular implementations, for particular purposes, and but do demonstrate aspects of a language that could be useful in many situations, but which should not be over generalized from. This is the most egregious issue for me, the apparent attempt to classify language suitability on a frankly meaningless code snippet which is hardly an example of any useful real-world program – this is something we try to stamp out from even the newest of starters, and from a professor of computer science seems quite ridiculous to me. YMMV obviously.
Can you do a test to show that for small programs similar to “Hello World” but not necessarily the same C++ runs as fast as C if not faster? This would settle the issue. Wouldn’t it?
At an extreme, you could try something like this
https://onecompiler.com/cpp/3wdmzd9js
(or trying Googling for ‘constexptr fibonacci’)
Re-working that as C should give an indication of what can be done, but – like the article – it’s really missing the point, and I could probably equally well make a C++ version that is far worse **.
One of the main reasons that individual, micro-benchmarks like this aren’t useful for answering questions like “is language X faster than language Y ?” is that the question is completely meaningless.
What we can do is ask, for my particular problem space *and* my typical data sets / operating conditions – what would a good choice of language and strategy be ? If, for example, you were designing an ultra-high speed / low latency peer-to-peer message passing system, you probably wouldn’t choose Python. However if you wanted to implement a simple peer-to-peer client-server application then Python, with it’s interpreted nature and rich library support, would make such a thing relatively trivial. It’s exactly these sort of evaluations that should be driven into programmers, and first year computer science students in Quebec and elsewhere, from day one.
Using massively simplified, and atypical, noddy code fragments -especially when naively implemented – is not really helpful or instructive, and mainly serves to teach people bad coding practice and poor performance analysis techniques IMO.
** These are poor Fibonacci number generators, so don’t use them in any performance sensitive regime, they just an example 🙂
It is still being honed. Memchr for instance uses sse2 instructions on x86-64 machines. These instructions were available only long after both c and c++ gained widespread adoption. Memchr beats std::find
https://gms.tf/stdfind-and-memchr-optimizations.html
Glibc is much more optimized than libstdc++ simply because it is much smaller,and therefore developers can devote more time to optimization.
The truth is that abstractions come at a cost of complexity and size which makes it harder to optimize. “Zero-cost abstractions” may be true in a few cases, but there will always be cases that are too hard or time-consuming to look into. It is a simple matter of tradeoffs.
You didn’t provide what compiler you used. I assume it was gcc. In gcc “printf” is one of the built-in functions. This means there is no library involved at all (neither dynamic nor static). It’s practically part of the language and the #include is just for syntax reasons.
I didn’t read the internals but I assume that gcc doesn’t call a classical printf at all put optimises it on the compiler level, e.g. do the formating at compile-time and use the ‘write’ syscall directly.
I use a straight Ubuntu 22 and the Makefile is provided (see links), so yes: gcc.
You make a good point regarding printf.
This program will run faster than any C program written using the standard qsort library function:
#include
#include
#include
#include
using namespace ::std;
constexpr int ipow(int base, int exp)
{
if (exp == 0) {
return 1;
} else if (exp < 0) {
return 0;
} else if (exp % 2) {
return base * ipow(base, exp – 1);
} else {
return ipow(base * base, exp / 2);
}
}
int main()
{
vector foo(ipow(2,30));
random_device rd; //Will be used to obtain a seed for the random number engine
mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
uniform_int_distribution dis;
generate(foo.begin(), foo.end(),
[&gen, &dis]() {
return dis(gen);
});
cerr << "Sorting.\n";
sort(foo.begin(), foo.end());
return is_sorted(foo.begin(), foo.end()) ? 0 : 1;
}
It is true that C++ has many advantages over C as far as algorithmic implementation goes.
Your program allocates gigabytes of memory and sorts it. If you reduce the task to sorting 12 numbers, the answer might be different, and that’s the motivation of my blog post.
It isn’t that the implementation of sort is better in C++, it’s that you can’t reasonably make a version of the qsort function in C that runs faster than sort in C++.
And this is because the sort function in C++ is a template, and the compiler essentially writes you a custom one for the data structure and comparison function you’re using in which the comparison and swap functions are inlined into sort and then subjected to aggressive optimizations.
Making this happen in C would require macro magic of the highest order, and even then would probably be a huge pain to use correctly.
Your “Hello world” case reads like a general criticism of C++, when I strongly suspect that C++ is faster than C in most cases because of things like I just mentioned. So, it seems like a criticism that’s narrowly tailored to make a point that I don’t think is particularly accurate.
Eric, you are interpreting my post to say something I do not say (C++ is slow).
I have written and co-written several high performance projects in C++. E.g., please see https://github.com/simdjson/simdjson
After an extended play with this I would say that the library/flushing issues that most commenters aren’t anything to worry about. All the significant difference in timings seem to be due to the dynamic linker.
C code that dynamically links to libc takes ~240µs, which goes down to ~150µs when statically linked. A fully dynamic C++ build takes ~800µs, while a fully static C++ build is only ~190µs. Across all of these, the different between printing one “hello world” vs 1000 is only ~20µs.
Getting good timings was the hardest thing here! Code/analysis are in:
https://github.com/smason/lemire-hello
Sorry to post more than one comment. I would go back and edit my original if I could….
The reason that C++ is taking longer here is that the runtime environment of C++ is more complex. Not a LOT more complex, but, it is more complex. C++ has global constructors and destructors that need to be executed on program startup and shutdown. Additionally, the compiler needs to track which global destructors need to be called because (in the case, for example, of local ‘static’ variables in side functions) which ones need to be called can only be determined at runtime. This requires a global data structure that’s initialized on program startup and scanned on program shutdown.
Additionally, there will be some overhead required to set up the exception handler of last resort.
I have a hello world written in C++ that will execute faster than C, but it requires passing lots of compiler options to turn off the compiler’s setup of the C++ runtime environment. It would be possible to duplicate this program in C, but it would be challenging, especially with the quality of error handling it’s possible to achieve using my library:
My library: https://osdn.net/projects/posixpp/scm/hg/posixpp
(Github mirror): https://osdn.net/projects/posixpp/scm/hg/posixpp
Link to hello world program written using my library: https://osdn.net/projects/posixpp/scm/hg/posixpp/blobs/tip/examples/helloworld.cpp
I guess my response would be “duh”. With C you have very little object code and a single call to the printf function in a statically loaded library. With C++ you have substantial startup overhead loading the iostream library and all the modules it depends on. Address allocation takes time and it’s going to prepare for all the possible dynamic libraries you might load as well.
Comments are Better than the article IMO
Iostreams are known to have performance issues so this isn’t earth shattering news. I’m glad you mention std::format. libfmt which that evolved from has a function called fmt::print…I guarantee fmt::print(“hello world\n”); will not just be as fast as printf, but faster. Especially if there’s a lot of formatting to be done. This is because it can do some of the formatting work at compile time. And it’s typesafe, so no having to worry and remember the gazillion printf variations. It’s freaking amazing. The print function didn’t make it to c++20, but I believe it’s being pushed for in c++23.
The title should be “C++ streams are slower than printf” which is a known fact as streams favor versatility over performance. Streams are significantly different to print functions since formatting is stored as state within the stream object and takes time to construct and destruct once for the program lifetime.
How about a test where you simply output the exact same code compiled in c and c++ …. this will account for the apple to orange comparison…
also store the time before and after in μSec or a GetTickCount() before and after the call….. and output this data at the end….. this will account for startup lib / runtime difference….
Instead, I suggest you to use libfmt instead. It is safer and faster. Also, note that the newer C++ standard has replaced iostreams with better alternatives. If you are micro-optimizing, you should consider these details.
Because used std::
It would be more accurate if you printed several thousand lines in a loop. The execution time of printing a single line could easily be confused with loading and startup time. Then, there is also the flushing. You want to make sure that you are flushing the same number of times.
Lastly, given the object oriented nature of C++, it would make sense to turn on optimisations.
The programs are compiled with optimization (-O2).
As long as you’re not trying to measure the performance the languages can offer under GCC, that might be adequate. If you’re wanting to try to replicate what typical release production code would do, then that’s probably not (partially depending on the functionality being used).
You could throw in some dirty inline assembly lines involving kernel syscall to improve performance.
Doing a benchmark of such small time can be tricky, even only for various caches.
It is not a good test neither, as 1ms difference (in 1 run?) for doing no processing at all with data, has no significance nor a value to use language A rather than B.
No one in the world could be interested in investing to save 1ms for a program that is doing nothing. Because it resolves no problems (but actually it creates one 😂)
If you want a real comparison of some small routines (this one used for the test it isn’t, but it could be used ) and if you can’t profile them, the way to go is to look at the generated assembler code.
Than on that can be done an analysis, test, benchmark and writing some conclusion.
Beside focusing on a real problem, it could hp better do this kind of comparison c/c++
One small improvement on this test could be to take out the “net weight” computing the “tare”:
Run both programs with empty main
Then hello world and do the net time execution of the hello world.
Here the tricky part is the include statement should be present or not?
Beside the expectation should be that the 2 empty programs take the same time to run,
Otherwise it implies that hello world itself isn’t faster or slower, but there is some bootstrap overhead.
Anyway.
I Enjoyed the post.
Thanks
Incorrect. This isn’t valid performance analysis. In fact, using GCC 11.2 with -Ofast, std::cout is 1.1x faster. At -O3, they are the same.
https://quick-bench.com/q/lGltfiZ439DZGuBm1yc_GEy2TYQ
As the post as been significantly re-written, the emphasis of the slow-down altered, and a number of the criticisms folded into the text as original text, it might be nice to address more of that in replies/updates to the comments and/or acknowledged in the new article text as appropriate.
This has been done is a couple of cases, but not all – and this puts the revised text at odds with the historical comments.
The issue of relevance and suitability of the micro-benchmark as-is is not really dealt with either (e.g. if the absolute time was important you would profile, adjust, iterate – if it’s not important, it’s not important), but that’s another matter.
Thanks. Unfortunately people repeatedly proposed alternative explanations, without running benchmarks themselves nor accounting for the critical point that my post makes: the speed dramatically increased after statically linking. I have added a paragraph to acknowledge these additions as you suggested. I do thank the various readers for their proposals but I am not going to answer point-by-point dozens of closely related comments.
Regarding the relevance of the benchmark, I have explained and re-explained it at length. For long running processes, the issue has always been irrelevant, but if you have short running processes (executing in about a millisecond or less), then you may be spending most of your time loading the standard library. You may not care, of course… but it is useful to be aware. There are solutions such as static linking, but there are tradeoffs.
What this shows (and really all this shows) is that the C++ library is a lot bigger than the C library. If you are not using the capability it provides, it is costing you performance.
On the other hand, if you have significant work on a complex problem, you can have better performance because the library and the language provides facilities that would be difficult and expensive to write in C.