18th June 2019, 20 min read

How fast is getline in C++?

28 thoughts on “How fast is getline in C++?”

Joey says:

June 18, 2019 at 8:38 pm

I wonder how much of a speed-up it is to memory-map a file(mmap) and implementing getline as a re-entrant function with just pointer arithmetic(string_view?).
1. Daniel Lemire says:
  
  June 18, 2019 at 8:46 pm
  
  My benchmark avoids disk access entirely.
  1. CAFxX says:
    
    June 18, 2019 at 11:21 pm
    
    But it does not avoid the syscalls that a pre-faulted mapped view of the file would avoid, no?
    1. Travis Downs says:
      
      June 19, 2019 at 3:38 am
      
      The benchmark is entirely in userspace: it is not reading from a file at all, but building a long stream in memory and passing that inside a std::istringstream to getline which works on any stream, not just files.
      1. Travis Downs says:
        
        June 19, 2019 at 3:40 am
        
        I.e., Daniel’s point is “even cutting files entirely out of the equation, getline is still limited to 2 GB/s”.
        
        Daniel Lemire says:
        
        June 19, 2019 at 1:20 pm
        
        That’s right. My claim is that the getline function can be a bottleneck.
      2. victor Bogado da Silva Lins says:
        
        June 19, 2019 at 9:16 am
        
        But then the whole stringstream mechanic will take a hold on the process. Iostreams use virtual calls all over the place and that is know to break the CPU pipeline.
        
        Daniel Lemire says:
        
        June 19, 2019 at 1:15 pm
        
        Do you expect that getline would be faster on an ifstream?
    2. Daniel Lemire says:
      
      June 19, 2019 at 1:18 pm
      
      There is no file involved.
CAFxX says:

June 18, 2019 at 10:28 pm

I think a better example would help drive the point home, as if you only care about the size of the file fseek+ftell or fstat should be O(1) instead of O(n).

(Unless you really need the size of the file excluding newlines, but that seems to have somewhat limited utility)
1. Daniel Lemire says:
  
  June 18, 2019 at 10:36 pm
  
  Implicitly, I am inviting the reader to fill in the function and actually do something with the strings; that’s the homework assignment.
-.- says:

June 19, 2019 at 12:25 am

Is getline implemented in libstdc++? If so, one would think that its performance would definitely depend on what C++ library implementation you’re using.

There could also be overheads when dealing with streams and strings, but I’m not knowledgeable enough with C++ to really know.
1. Travis Downs says:
  
  June 19, 2019 at 3:18 am
  
  getline is a C function, so it’s not implemented in libstdc++, but rather libc or whatever MSVCRT thing is used on Windows.
  1. Travis Downs says:
    
    June 19, 2019 at 3:30 am
    
    Nevermind, I was mixing it up with the identically named getline(3) function in POSIX. Daniel is talking about std::getline in C++ and yes it would be implemented in libstc++.
2. Daniel Lemire says:
  
  June 19, 2019 at 1:17 pm
  
  Yes: I expect that the performance depends on the standard library. I use docker with a gcc container.
3. Max Lybbert says:
  
  June 19, 2019 at 5:56 pm
  
  C++ streams are locale-aware, and have plenty of virtual methods (so that it’s possible to have streams that get their input from files, and others that get their input from strings, etc.), and C++ strings allocate memory when needed.
  
  That is, there are plenty of opportunities for overhead in std::getline. Whenever I write C++, I honestly ask “do I need this input to be fast or convenient?” anytime I read from a file.
  1. -.- says:
    
    June 20, 2019 at 5:39 am
    
    Thanks for the info!
Travis Downs says:

June 19, 2019 at 3:27 am

It varies by OS, but there are some hard limits depending on how the “IO” is done (in quotes, because it might not involve any IO in the usual case the pages are cached).

If you use mmap, you need to fault-in every 4k page you read. Linux has “fault around” which means several pages are mapped in for every fault but you still end up limited to about 10 GB/s.

The situation is worse for read(2) which needs to make a system call every time it needs to fill a buffer. No doubt getline() is ultimately using read(2) or something equivalent under the covers. Every system call costs ~1000 cycles or more after Spectre and Meltdown, so you need to use large buffers enough to amortize the cost – but large buffers only work to an extent: too large and you end up missing in cache. Finally, the kernel has to copy from the page cache to your process. I measured limits around 6 GB/s although I haven’t tried since Spectre & Meltdown.

OTOH if you can get your file mapped into a 2 MIB page, the speed limit is 100 GB/s or more for mmap(). I’m not sure if that’s supported by any file system other than tmpfs yet.

All of those limits are significantly higher than 2 GB/s, so I guess this benchmark is mostly CPU-bound inside the getline call.
Wilco says:

June 19, 2019 at 2:30 pm

It’s worth pointing out (again) that this does not test IO at all. Getline is just memchr + memcpy with buffer overflow checking – the benchmark spends more time in memchr than in getline.

If I do a loop with memchr to find the end of a line followed by memcpy into a fixed buffer, it is 70% faster. But with no checks whatsoever, it’s not safe or robust…

However if I increase the average string size by 10x, the C++ version becomes more than 4 times as fast and the bare-bones version is 3 times as fast. The C++ overhead is now less than 25%.

So clearly the C++ overhead is not what is killing us here, it’s the cost of doing many small unpredictable memchr/memcpy. Placing the line ends exactly 80 characters apart doubles performance of the memchr/memcpy.
Hung Dang says:

June 19, 2019 at 7:38 pm

From my experience, there are two items that can speed up getline: faster memchr. This is my simple avx2 implementation https://github.com/hungptit/utils/blob/master/src/memchr.hpp for memchr function which has been used in fast-wc. Manage all memory related operations in the userspace if possible. Below are simple benchmark results for a very large log file in the mapped memory disk to demonstrate that both wc ad fast-wc can process lines with the speed of 4GB/s.

CPU
shell processor : 87 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz stepping : 1 microcode : 0xb00002e cpu MHz : 2200.081 cache size : 56320 KB

wc
Linux> time wc /log/workqueue-execution-2019-06-17_00000 ^C15.32user 0.22system 0:15.56elapsed 99%CPU (0avgtext+0avgdata 672maxresident)k 0inputs+0outputs (0major+197minor)pagefaults 0swaps Linux> time wc -l /log/workqueue-execution-2019-06-17_00000 9333291 /log/workqueue-execution-2019-06-17_00000 1.48user 4.21system 0:05.70elapsed 99%CPU (0avgtext+0avgdata 644maxresident)k 0inputs+0outputs (0major+191minor)pagefaults 0swaps

fast-wc
Linux> ls -l /log/workqueue-execution-2019-06-17_00000 -rw-r--r-- 1 hdang users 21650199150 Jun 19 15:08 /log/workqueue-execution-2019-06-17_00000 Linux> time fast-wc /log/workqueue-execution-2019-06-17_00000 Number of lines: 9333291 Max line length: 978370 Min line length: 132 File size: 21650199150 1.06user 3.82system 0:04.89elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k 0inputs+0outputs (0major+188minor)pagefaults 0swaps
1. Hung Dang says:
  
  June 20, 2019 at 2:36 pm
  
  The format of my above comment is screwed. Is there any way to fix it?
  1. Daniel Lemire says:
    
    June 20, 2019 at 3:20 pm
    
    Fixed.
    1. Hung Dang says:
      
      June 20, 2019 at 5:08 pm
      
      Thank a lot Daniel.
David Smith says:

June 19, 2019 at 8:09 pm

Here’s the fastest version I’ve been able to come up with for handling multi-line text; after a single execution rax will contain the current bufferOffset and rdx will contain the isQuotedSequence indicator. Actual handling of the string is left as an exercise to the reader =P.

https://gist.github.com/Kittoes0124/c2288ec4daee90f549dfe73430494847

Would love to see you improve upon!
1. Wilco says:
  
  June 25, 2019 at 11:17 pm
  
  I’m not sure what this is supposed to do, however this code can be sped up a lot. For example with unrolling you only really need to check the buffer end once every N characters. Similarly you only need one compare to check all the special cases, eg. if ((ch ^ 2) <= 0x20) will catch all special cases. And when handling the special cases you’d use logical operations and conditional moves to avoid branch mispredictions. An advanced version would use SIMD instructions to do all this on N characters at a time.
Christopher Chang says:

June 25, 2019 at 6:34 pm

I wrote a low-level ReadLineStream interface (https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_decompress.h#L188 ) to take care of this issue, along with transparent zstd/bgzf decompression. It’s not directly usable in other programs, but it shouldn’t take that much work to make it into a normal library.
Preston L. Bannister says:

July 29, 2019 at 4:30 am

Using setvbuf(file, 0, _IOFBF, 102400); can do wonders to improve reads on a C FILE stream. Last time I measured (~20 years ago?), the knee of the curve was under 100K.

I simply have found no reason to use the C++ streams, so I cannot comment on any equivalent.

For bare-metal performance, I do read() into a very large buffer, and parse out the lines, as in:

https://github.com/pbannister/wide-finder/blob/master/ZReader.cpp

Recently used a very similar approach to parse raw radar data and dump in text form. Must be very fast to be useful to the customer, as the data can easily be tens of gigabytes per run. Tool is bottlenecked on storage performance, on the target system (eight spinning disks in RAID 0). Not code I can share, in this case.
Maxim Egorushkin says:

July 30, 2019 at 4:13 pm

You may like to call std::ios_base::sync_with_stdio(false) before reading, it often speeds things up.