I wonder how much of a speed-up it is to memory-map a file(mmap) and implementing getline as a re-entrant function with just pointer arithmetic(string_view?).
But it does not avoid the syscalls that a pre-faulted mapped view of the file would avoid, no?
Travis Downssays:
The benchmark is entirely in userspace: it is not reading from a file at all, but building a long stream in memory and passing that inside a std::istringstream to getline which works on any stream, not just files.
Travis Downssays:
I.e., Daniel’s point is “even cutting files entirely out of the equation, getline is still limited to 2 GB/s”.
That’s right. My claim is that the getline function can be a bottleneck.
victor Bogado da Silva Linssays:
But then the whole stringstream mechanic will take a hold on the process. Iostreams use virtual calls all over the place and that is know to break the CPU pipeline.
I think a better example would help drive the point home, as if you only care about the size of the file fseek+ftell or fstat should be O(1) instead of O(n).
(Unless you really need the size of the file excluding newlines, but that seems to have somewhat limited utility)
Implicitly, I am inviting the reader to fill in the function and actually do something with the strings; that’s the homework assignment.
-.-says:
Is getline implemented in libstdc++? If so, one would think that its performance would definitely depend on what C++ library implementation you’re using.
There could also be overheads when dealing with streams and strings, but I’m not knowledgeable enough with C++ to really know.
Travis Downssays:
getline is a C function, so it’s not implemented in libstdc++, but rather libc or whatever MSVCRT thing is used on Windows.
Travis Downssays:
Nevermind, I was mixing it up with the identically named getline(3) function in POSIX. Daniel is talking about std::getline in C++ and yes it would be implemented in libstc++.
Yes: I expect that the performance depends on the standard library. I use docker with a gcc container.
Max Lybbertsays:
C++ streams are locale-aware, and have plenty of virtual methods (so that it’s possible to have streams that get their input from files, and others that get their input from strings, etc.), and C++ strings allocate memory when needed.
That is, there are plenty of opportunities for overhead in std::getline. Whenever I write C++, I honestly ask “do I need this input to be fast or convenient?” anytime I read from a file.
-.-says:
Thanks for the info!
Travis Downssays:
It varies by OS, but there are some hard limits depending on how the “IO” is done (in quotes, because it might not involve any IO in the usual case the pages are cached).
If you use mmap, you need to fault-in every 4k page you read. Linux has “fault around” which means several pages are mapped in for every fault but you still end up limited to about 10 GB/s.
The situation is worse for read(2) which needs to make a system call every time it needs to fill a buffer. No doubt getline() is ultimately using read(2) or something equivalent under the covers. Every system call costs ~1000 cycles or more after Spectre and Meltdown, so you need to use large buffers enough to amortize the cost – but large buffers only work to an extent: too large and you end up missing in cache. Finally, the kernel has to copy from the page cache to your process. I measured limits around 6 GB/s although I haven’t tried since Spectre & Meltdown.
OTOH if you can get your file mapped into a 2 MIB page, the speed limit is 100 GB/s or more for mmap(). I’m not sure if that’s supported by any file system other than tmpfs yet.
All of those limits are significantly higher than 2 GB/s, so I guess this benchmark is mostly CPU-bound inside the getline call.
Wilcosays:
It’s worth pointing out (again) that this does not test IO at all. Getline is just memchr + memcpy with buffer overflow checking – the benchmark spends more time in memchr than in getline.
If I do a loop with memchr to find the end of a line followed by memcpy into a fixed buffer, it is 70% faster. But with no checks whatsoever, it’s not safe or robust…
However if I increase the average string size by 10x, the C++ version becomes more than 4 times as fast and the bare-bones version is 3 times as fast. The C++ overhead is now less than 25%.
So clearly the C++ overhead is not what is killing us here, it’s the cost of doing many small unpredictable memchr/memcpy. Placing the line ends exactly 80 characters apart doubles performance of the memchr/memcpy.
Hung Dangsays:
From my experience, there are two items that can speed up getline: faster memchr. This is my simple avx2 implementation https://github.com/hungptit/utils/blob/master/src/memchr.hpp for memchr function which has been used in fast-wc. Manage all memory related operations in the userspace if possible. Below are simple benchmark results for a very large log file in the mapped memory disk to demonstrate that both wc ad fast-wc can process lines with the speed of 4GB/s.
CPU shell
processor : 87
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping : 1
microcode : 0xb00002e
cpu MHz : 2200.081
cache size : 56320 KB
fast-wc Linux> ls -l /log/workqueue-execution-2019-06-17_00000
-rw-r--r-- 1 hdang users 21650199150 Jun 19 15:08 /log/workqueue-execution-2019-06-17_00000
Linux> time fast-wc /log/workqueue-execution-2019-06-17_00000
Number of lines: 9333291
Max line length: 978370
Min line length: 132
File size: 21650199150
1.06user 3.82system 0:04.89elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k
0inputs+0outputs (0major+188minor)pagefaults 0swaps
Hung Dangsays:
The format of my above comment is screwed. Is there any way to fix it?
Here’s the fastest version I’ve been able to come up with for handling multi-line text; after a single execution rax will contain the current bufferOffset and rdx will contain the isQuotedSequence indicator. Actual handling of the string is left as an exercise to the reader =P.
I’m not sure what this is supposed to do, however this code can be sped up a lot. For example with unrolling you only really need to check the buffer end once every N characters. Similarly you only need one compare to check all the special cases, eg. if ((ch ^ 2) <= 0x20) will catch all special cases. And when handling the special cases you’d use logical operations and conditional moves to avoid branch mispredictions. An advanced version would use SIMD instructions to do all this on N characters at a time.
Christopher Changsays:
I wrote a low-level ReadLineStream interface (https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_decompress.h#L188 ) to take care of this issue, along with transparent zstd/bgzf decompression. It’s not directly usable in other programs, but it shouldn’t take that much work to make it into a normal library.
Using setvbuf(file, 0, _IOFBF, 102400); can do wonders to improve reads on a C FILE stream. Last time I measured (~20 years ago?), the knee of the curve was under 100K.
I simply have found no reason to use the C++ streams, so I cannot comment on any equivalent.
For bare-metal performance, I do read() into a very large buffer, and parse out the lines, as in:
Recently used a very similar approach to parse raw radar data and dump in text form. Must be very fast to be useful to the customer, as the data can easily be tens of gigabytes per run. Tool is bottlenecked on storage performance, on the target system (eight spinning disks in RAID 0). Not code I can share, in this case.
Maxim Egorushkinsays:
You may like to call std::ios_base::sync_with_stdio(false) before reading, it often speeds things up.
I wonder how much of a speed-up it is to memory-map a file(
mmap
) and implementinggetline
as a re-entrant function with just pointer arithmetic(string_view
?).My benchmark avoids disk access entirely.
But it does not avoid the syscalls that a pre-faulted mapped view of the file would avoid, no?
The benchmark is entirely in userspace: it is not reading from a file at all, but building a long stream in memory and passing that inside a
std::istringstream
to getline which works on any stream, not just files.I.e., Daniel’s point is “even cutting files entirely out of the equation, getline is still limited to 2 GB/s”.
That’s right. My claim is that the getline function can be a bottleneck.
But then the whole stringstream mechanic will take a hold on the process. Iostreams use virtual calls all over the place and that is know to break the CPU pipeline.
Do you expect that getline would be faster on an ifstream?
There is no file involved.
I think a better example would help drive the point home, as if you only care about the size of the file fseek+ftell or fstat should be O(1) instead of O(n).
(Unless you really need the size of the file excluding newlines, but that seems to have somewhat limited utility)
Implicitly, I am inviting the reader to fill in the function and actually do something with the strings; that’s the homework assignment.
Is getline implemented in libstdc++? If so, one would think that its performance would definitely depend on what C++ library implementation you’re using.
There could also be overheads when dealing with streams and strings, but I’m not knowledgeable enough with C++ to really know.
getline
is a C function, so it’s not implemented in libstdc++, but rather libc or whatever MSVCRT thing is used on Windows.Nevermind, I was mixing it up with the identically named getline(3) function in POSIX. Daniel is talking about std::getline in C++ and yes it would be implemented in libstc++.
Yes: I expect that the performance depends on the standard library. I use docker with a gcc container.
C++ streams are locale-aware, and have plenty of virtual methods (so that it’s possible to have streams that get their input from files, and others that get their input from strings, etc.), and C++ strings allocate memory when needed.
That is, there are plenty of opportunities for overhead in std::getline. Whenever I write C++, I honestly ask “do I need this input to be fast or convenient?” anytime I read from a file.
Thanks for the info!
It varies by OS, but there are some hard limits depending on how the “IO” is done (in quotes, because it might not involve any IO in the usual case the pages are cached).
If you use mmap, you need to fault-in every 4k page you read. Linux has “fault around” which means several pages are mapped in for every fault but you still end up limited to about 10 GB/s.
The situation is worse for read(2) which needs to make a system call every time it needs to fill a buffer. No doubt getline() is ultimately using read(2) or something equivalent under the covers. Every system call costs ~1000 cycles or more after Spectre and Meltdown, so you need to use large buffers enough to amortize the cost – but large buffers only work to an extent: too large and you end up missing in cache. Finally, the kernel has to copy from the page cache to your process. I measured limits around 6 GB/s although I haven’t tried since Spectre & Meltdown.
OTOH if you can get your file mapped into a 2 MIB page, the speed limit is 100 GB/s or more for mmap(). I’m not sure if that’s supported by any file system other than tmpfs yet.
All of those limits are significantly higher than 2 GB/s, so I guess this benchmark is mostly CPU-bound inside the getline call.
It’s worth pointing out (again) that this does not test IO at all. Getline is just memchr + memcpy with buffer overflow checking – the benchmark spends more time in memchr than in getline.
If I do a loop with memchr to find the end of a line followed by memcpy into a fixed buffer, it is 70% faster. But with no checks whatsoever, it’s not safe or robust…
However if I increase the average string size by 10x, the C++ version becomes more than 4 times as fast and the bare-bones version is 3 times as fast. The C++ overhead is now less than 25%.
So clearly the C++ overhead is not what is killing us here, it’s the cost of doing many small unpredictable memchr/memcpy. Placing the line ends exactly 80 characters apart doubles performance of the memchr/memcpy.
From my experience, there are two items that can speed up getline: faster memchr. This is my simple avx2 implementation https://github.com/hungptit/utils/blob/master/src/memchr.hpp for memchr function which has been used in fast-wc. Manage all memory related operations in the userspace if possible. Below are simple benchmark results for a very large log file in the mapped memory disk to demonstrate that both wc ad fast-wc can process lines with the speed of 4GB/s.
CPU
shell
processor : 87
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping : 1
microcode : 0xb00002e
cpu MHz : 2200.081
cache size : 56320 KB
wc
Linux> time wc /log/workqueue-execution-2019-06-17_00000
^C15.32user 0.22system 0:15.56elapsed 99%CPU (0avgtext+0avgdata 672maxresident)k
0inputs+0outputs (0major+197minor)pagefaults 0swaps
Linux> time wc -l /log/workqueue-execution-2019-06-17_00000
9333291 /log/workqueue-execution-2019-06-17_00000
1.48user 4.21system 0:05.70elapsed 99%CPU (0avgtext+0avgdata 644maxresident)k
0inputs+0outputs (0major+191minor)pagefaults 0swaps
fast-wc
Linux> ls -l /log/workqueue-execution-2019-06-17_00000
-rw-r--r-- 1 hdang users 21650199150 Jun 19 15:08 /log/workqueue-execution-2019-06-17_00000
Linux> time fast-wc /log/workqueue-execution-2019-06-17_00000
Number of lines: 9333291
Max line length: 978370
Min line length: 132
File size: 21650199150
1.06user 3.82system 0:04.89elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k
0inputs+0outputs (0major+188minor)pagefaults 0swaps
The format of my above comment is screwed. Is there any way to fix it?
Fixed.
Thank a lot Daniel.
Here’s the fastest version I’ve been able to come up with for handling multi-line text; after a single execution rax will contain the current bufferOffset and rdx will contain the isQuotedSequence indicator. Actual handling of the string is left as an exercise to the reader =P.
https://gist.github.com/Kittoes0124/c2288ec4daee90f549dfe73430494847
Would love to see you improve upon!
I’m not sure what this is supposed to do, however this code can be sped up a lot. For example with unrolling you only really need to check the buffer end once every N characters. Similarly you only need one compare to check all the special cases, eg. if ((ch ^ 2) <= 0x20) will catch all special cases. And when handling the special cases you’d use logical operations and conditional moves to avoid branch mispredictions. An advanced version would use SIMD instructions to do all this on N characters at a time.
I wrote a low-level ReadLineStream interface (https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_decompress.h#L188 ) to take care of this issue, along with transparent zstd/bgzf decompression. It’s not directly usable in other programs, but it shouldn’t take that much work to make it into a normal library.
Using
setvbuf(file, 0, _IOFBF, 102400);
can do wonders to improve reads on a C FILE stream. Last time I measured (~20 years ago?), the knee of the curve was under 100K.I simply have found no reason to use the C++ streams, so I cannot comment on any equivalent.
For bare-metal performance, I do read() into a very large buffer, and parse out the lines, as in:
https://github.com/pbannister/wide-finder/blob/master/ZReader.cpp
Recently used a very similar approach to parse raw radar data and dump in text form. Must be very fast to be useful to the customer, as the data can easily be tens of gigabytes per run. Tool is bottlenecked on storage performance, on the target system (eight spinning disks in RAID 0). Not code I can share, in this case.
You may like to call std::ios_base::sync_with_stdio(false) before reading, it often speeds things up.