Daniel Lemire's blog

, 20 min read

How fast is getline in C++?

28 thoughts on “How fast is getline in C++?”

  1. Joey says:

    I wonder how much of a speed-up it is to memory-map a file(mmap) and implementing getline as a re-entrant function with just pointer arithmetic(string_view?).

    1. My benchmark avoids disk access entirely.

      1. CAFxX says:

        But it does not avoid the syscalls that a pre-faulted mapped view of the file would avoid, no?

        1. Travis Downs says:

          The benchmark is entirely in userspace: it is not reading from a file at all, but building a long stream in memory and passing that inside a std::istringstream to getline which works on any stream, not just files.

          1. Travis Downs says:

            I.e., Daniel’s point is “even cutting files entirely out of the equation, getline is still limited to 2 GB/s”.

            1. That’s right. My claim is that the getline function can be a bottleneck.

          2. victor Bogado da Silva Lins says:

            But then the whole stringstream mechanic will take a hold on the process. Iostreams use virtual calls all over the place and that is know to break the CPU pipeline.

            1. Do you expect that getline would be faster on an ifstream?

        2. There is no file involved.

  2. CAFxX says:

    I think a better example would help drive the point home, as if you only care about the size of the file fseek+ftell or fstat should be O(1) instead of O(n).

    (Unless you really need the size of the file excluding newlines, but that seems to have somewhat limited utility)

    1. Implicitly, I am inviting the reader to fill in the function and actually do something with the strings; that’s the homework assignment.

  3. -.- says:

    Is getline implemented in libstdc++? If so, one would think that its performance would definitely depend on what C++ library implementation you’re using.

    There could also be overheads when dealing with streams and strings, but I’m not knowledgeable enough with C++ to really know.

    1. Travis Downs says:

      getline is a C function, so it’s not implemented in libstdc++, but rather libc or whatever MSVCRT thing is used on Windows.

      1. Travis Downs says:

        Nevermind, I was mixing it up with the identically named getline(3) function in POSIX. Daniel is talking about std::getline in C++ and yes it would be implemented in libstc++.

    2. Yes: I expect that the performance depends on the standard library. I use docker with a gcc container.

    3. Max Lybbert says:

      C++ streams are locale-aware, and have plenty of virtual methods (so that it’s possible to have streams that get their input from files, and others that get their input from strings, etc.), and C++ strings allocate memory when needed.

      That is, there are plenty of opportunities for overhead in std::getline. Whenever I write C++, I honestly ask “do I need this input to be fast or convenient?” anytime I read from a file.

      1. -.- says:

        Thanks for the info!

  4. Travis Downs says:

    It varies by OS, but there are some hard limits depending on how the “IO” is done (in quotes, because it might not involve any IO in the usual case the pages are cached).

    If you use mmap, you need to fault-in every 4k page you read. Linux has “fault around” which means several pages are mapped in for every fault but you still end up limited to about 10 GB/s.

    The situation is worse for read(2) which needs to make a system call every time it needs to fill a buffer. No doubt getline() is ultimately using read(2) or something equivalent under the covers. Every system call costs ~1000 cycles or more after Spectre and Meltdown, so you need to use large buffers enough to amortize the cost – but large buffers only work to an extent: too large and you end up missing in cache. Finally, the kernel has to copy from the page cache to your process. I measured limits around 6 GB/s although I haven’t tried since Spectre & Meltdown.

    OTOH if you can get your file mapped into a 2 MIB page, the speed limit is 100 GB/s or more for mmap(). I’m not sure if that’s supported by any file system other than tmpfs yet.

    All of those limits are significantly higher than 2 GB/s, so I guess this benchmark is mostly CPU-bound inside the getline call.

  5. Wilco says:

    It’s worth pointing out (again) that this does not test IO at all. Getline is just memchr + memcpy with buffer overflow checking – the benchmark spends more time in memchr than in getline.

    If I do a loop with memchr to find the end of a line followed by memcpy into a fixed buffer, it is 70% faster. But with no checks whatsoever, it’s not safe or robust…

    However if I increase the average string size by 10x, the C++ version becomes more than 4 times as fast and the bare-bones version is 3 times as fast. The C++ overhead is now less than 25%.

    So clearly the C++ overhead is not what is killing us here, it’s the cost of doing many small unpredictable memchr/memcpy. Placing the line ends exactly 80 characters apart doubles performance of the memchr/memcpy.

  6. Hung Dang says:

    From my experience, there are two items that can speed up getline: faster memchr. This is my simple avx2 implementation https://github.com/hungptit/utils/blob/master/src/memchr.hpp for memchr function which has been used in fast-wc. Manage all memory related operations in the userspace if possible. Below are simple benchmark results for a very large log file in the mapped memory disk to demonstrate that both wc ad fast-wc can process lines with the speed of 4GB/s.

    CPU
    shell
    processor : 87
    vendor_id : GenuineIntel
    cpu family : 6
    model : 79
    model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    stepping : 1
    microcode : 0xb00002e
    cpu MHz : 2200.081
    cache size : 56320 KB

    wc
    Linux> time wc /log/workqueue-execution-2019-06-17_00000
    ^C15.32user 0.22system 0:15.56elapsed 99%CPU (0avgtext+0avgdata 672maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    Linux> time wc -l /log/workqueue-execution-2019-06-17_00000
    9333291 /log/workqueue-execution-2019-06-17_00000
    1.48user 4.21system 0:05.70elapsed 99%CPU (0avgtext+0avgdata 644maxresident)k
    0inputs+0outputs (0major+191minor)pagefaults 0swaps

    fast-wc
    Linux> ls -l /log/workqueue-execution-2019-06-17_00000
    -rw-r--r-- 1 hdang users 21650199150 Jun 19 15:08 /log/workqueue-execution-2019-06-17_00000
    Linux> time fast-wc /log/workqueue-execution-2019-06-17_00000
    Number of lines: 9333291
    Max line length: 978370
    Min line length: 132
    File size: 21650199150
    1.06user 3.82system 0:04.89elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k
    0inputs+0outputs (0major+188minor)pagefaults 0swaps

    1. Hung Dang says:

      The format of my above comment is screwed. Is there any way to fix it?

      1. Fixed.

        1. Hung Dang says:

          Thank a lot Daniel.

  7. David Smith says:

    Here’s the fastest version I’ve been able to come up with for handling multi-line text; after a single execution rax will contain the current bufferOffset and rdx will contain the isQuotedSequence indicator. Actual handling of the string is left as an exercise to the reader =P.

    https://gist.github.com/Kittoes0124/c2288ec4daee90f549dfe73430494847

    Would love to see you improve upon!

    1. Wilco says:

      I’m not sure what this is supposed to do, however this code can be sped up a lot. For example with unrolling you only really need to check the buffer end once every N characters. Similarly you only need one compare to check all the special cases, eg. if ((ch ^ 2) <= 0x20) will catch all special cases. And when handling the special cases you’d use logical operations and conditional moves to avoid branch mispredictions. An advanced version would use SIMD instructions to do all this on N characters at a time.

  8. Christopher Chang says:

    I wrote a low-level ReadLineStream interface (https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_decompress.h#L188 ) to take care of this issue, along with transparent zstd/bgzf decompression. It’s not directly usable in other programs, but it shouldn’t take that much work to make it into a normal library.

  9. Using setvbuf(file, 0, _IOFBF, 102400); can do wonders to improve reads on a C FILE stream. Last time I measured (~20 years ago?), the knee of the curve was under 100K.

    I simply have found no reason to use the C++ streams, so I cannot comment on any equivalent.

    For bare-metal performance, I do read() into a very large buffer, and parse out the lines, as in:

    https://github.com/pbannister/wide-finder/blob/master/ZReader.cpp

    Recently used a very similar approach to parse raw radar data and dump in text form. Must be very fast to be useful to the customer, as the data can easily be tens of gigabytes per run. Tool is bottlenecked on storage performance, on the target system (eight spinning disks in RAID 0). Not code I can share, in this case.

  10. Maxim Egorushkin says:

    You may like to call std::ios_base::sync_with_stdio(false) before reading, it often speeds things up.