Interestingly, if I switch from using std::cin to using fread(3) on stdin, I get speeds closer to 2.6 GB/s on my Intel MacBook Pro running Catalina. Using std::cin is extremely slow. Using read(2) instead of read is a tad faster.
I would also recommend using write to maximize write throughput in case that’s the new bottleneck (the overhead of iostream varies per platform but is almost always observably bad…)
Mike Hurwitzsays:
I ran your tests and was able to average ~3GBps using cpispeed, though only 0.02GBps using pipespeed. The previous poster’s comment seems appropriate.
I threw together a quick test in Go (my language of choice) to see what kind of throughput I could get. With 4MB buffers I was seeing ~3.9GBps without cleaning my environment at all (Chrome running, e tc.).
Just for fun, I also put pv between the emitter and collectors in both your tests and mine. I chose pv because it’s a very common C-based tool that handles pipes. I saw a measurable but fairly slight drop in both benchmarks with pv in the middle. I guess that shows that pv is using one of the more efficient APIs rather than std::cin.
yes. using system api is much faster! i did some experiments a while ago with javascript and you can achieve these same speeds too: https://just.billywhizz.io/blog/on-javascript-performance-02/. the problem here is a lot of the time is being taken up by syscalls and the context switching into the kernel.
i think it would be possible to go (much) faster if we could do something entirely in userspace with, for example, io_uring on linux? https://unixism.net/loti/
I made a Rust version. I get about 5 or 6 GB/s on Linux (Fedora 34 on a Thinkpad T15). I can get over 7 GB/s piping straight into pv though, so my reader must be the bottleneck.
But pipes were even used to send messages such as orders in a factory with quite some success: https://en.wikipedia.org/wiki/Pneumatic_tube and these would commonly be placed vertical.
Alexsays:
I don’t have mac with dev tools at hand to verify, but some versions of C++ standard library generate very inefficient code in debug.
I wonder if you will get better results by adding -O in there.
Florian Lemaitresays:
At some point, I was using pipes to transfer a raw video stream from raspividyuv to my program, but the pipe throughput was too low to be processed in realtime.
So I tried replacing the pipe with a UNIX socket (replace pipe with socketpair) and the speedup was impressive: from 200 MB/s to 700 MB/s on a raspberry pi 3.
Apart from the code creating the “pipe” nothing was changed, and in particular, the reading and writing code were exactly the same.
This made me wondering: why a socket is faster than a pipe?
Antoinesays:
This is probably the C++ IO APIs showing their inadequacy. Even using Python you can probably achieve more than that (sorry, I don’t have a reproducer to submit :-)).
Element14says:
20 years ago when I was still in high school I dabbled in competitive programming a bit. Back when g++ was still version 3.x, it was a common pitfall to use #include for anything that involved heavy IO. Programs would literally run out of time just reading input.
It seems that in some implementations of iostream the issue is still here. At any rate there’s too much “magic” in C++ standard library that using fread (or better yet just the posix read()) would give much more accurate results if one is trying to measure the performance of OS pipes.
Juliansays:
By the way, since you’re comparing with read(2) already, I notice that using vmsplice(2) on Linux immediately triples my results.
I would be curious how the Windows pipe would compare. boost::system includes a simple pipe implementation that is cross platform for Windows and Linux.
I would be curious how the Windows pipe would compare. boost::process includes a simple pipe implementation that is cross platform for Windows and Linux.
This is a libc++ issue. On Ubuntu 21.04, when I compile with GCC 10.3, I get about 2.7-3.0 GB/s for both variants (cin and read). When I compile with Clang++ 11 using libstdc++, I get similar numbers. But when I compile with clang++ -stdlib=libc++ I get those 0.1GB/s vs 2.5 GB/s numbers. So the problem is QoI of libc++.
Interestingly, if I switch from using std::cin to using fread(3) on stdin, I get speeds closer to 2.6 GB/s on my Intel MacBook Pro running Catalina. Using std::cin is extremely slow. Using read(2) instead of read is a tad faster.
Verified. I have updated the blog post.
I would also recommend using
write
to maximize write throughput in case that’s the new bottleneck (the overhead of iostream varies per platform but is almost always observably bad…)I ran your tests and was able to average ~3GBps using
cpispeed
, though only 0.02GBps using pipespeed. The previous poster’s comment seems appropriate.I threw together a quick test in Go (my language of choice) to see what kind of throughput I could get. With 4MB buffers I was seeing ~3.9GBps without cleaning my environment at all (Chrome running, e tc.).
Just for fun, I also put
pv
between the emitter and collectors in both your tests and mine. I chosepv
because it’s a very common C-based tool that handles pipes. I saw a measurable but fairly slight drop in both benchmarks withpv
in the middle. I guess that shows thatpv
is using one of the more efficient APIs rather than std::cin.I love your ‘quick test in Go’.
yes. using system api is much faster! i did some experiments a while ago with javascript and you can achieve these same speeds too: https://just.billywhizz.io/blog/on-javascript-performance-02/. the problem here is a lot of the time is being taken up by syscalls and the context switching into the kernel.
i think it would be possible to go (much) faster if we could do something entirely in userspace with, for example, io_uring on linux? https://unixism.net/loti/
Have you tried to apply “std::ios::sync_with_stdio(false);”? See https://stackoverflow.com/a/9026594/
I do, please see source code in GitHub.
Interesting question, thanks!
I made a Rust version. I get about 5 or 6 GB/s on Linux (Fedora 34 on a Thinkpad T15). I can get over 7 GB/s piping straight into
pv
though, so my reader must be the bottleneck.https://gist.github.com/grahamking/a1bd00581fd15908338ee65f7937cbf1
“Plumbing” sounds so much like waste.
But pipes were even used to send messages such as orders in a factory with quite some success: https://en.wikipedia.org/wiki/Pneumatic_tube and these would commonly be placed vertical.
I don’t have mac with dev tools at hand to verify, but some versions of C++ standard library generate very inefficient code in debug.
I wonder if you will get better results by adding
-O
in there.At some point, I was using pipes to transfer a raw video stream from
raspividyuv
to my program, but the pipe throughput was too low to be processed in realtime.So I tried replacing the pipe with a UNIX socket (replace
pipe
withsocketpair
) and the speedup was impressive: from 200 MB/s to 700 MB/s on a raspberry pi 3.Apart from the code creating the “pipe” nothing was changed, and in particular, the reading and writing code were exactly the same.
This made me wondering: why a socket is faster than a pipe?
This is probably the C++ IO APIs showing their inadequacy. Even using Python you can probably achieve more than that (sorry, I don’t have a reproducer to submit :-)).
20 years ago when I was still in high school I dabbled in competitive programming a bit. Back when g++ was still version 3.x, it was a common pitfall to use #include for anything that involved heavy IO. Programs would literally run out of time just reading input.
It seems that in some implementations of iostream the issue is still here. At any rate there’s too much “magic” in C++ standard library that using fread (or better yet just the posix read()) would give much more accurate results if one is trying to measure the performance of OS pipes.
By the way, since you’re comparing with read(2) already, I notice that using vmsplice(2) on Linux immediately triples my results.
I would be curious how the Windows pipe would compare. boost::system includes a simple pipe implementation that is cross platform for Windows and Linux.
I would be curious how the Windows pipe would compare. boost::process includes a simple pipe implementation that is cross platform for Windows and Linux.
You probably should not send data like this in production anyway. Maybe the right way to send the data is to use some special libraries that allow you to stream data from one application to another. Maybe this library:
https://adios2.readthedocs.io/en/latest/engines/engines.html#sst-sustainable-staging-transport
This is a libc++ issue. On Ubuntu 21.04, when I compile with GCC 10.3, I get about 2.7-3.0 GB/s for both variants (cin and read). When I compile with Clang++ 11 using libstdc++, I get similar numbers. But when I compile with
clang++ -stdlib=libc++
I get those 0.1GB/s vs 2.5 GB/s numbers. So the problem is QoI of libc++.