Daniel Lemire's blog

, 27 min read

Which is fastest: read, fread, ifstream or mmap?

34 thoughts on “Which is fastest: read, fread, ifstream or mmap?”

  1. Todd Lipcon says:

    Results on my code i7 laptop don’t match up:

    fread 52.8416 63.133
    fread w sbuffer 53.9027 64.6808
    fread w lbuffer 55.4619 63.2864
    read2 73.746 49.1903
    mmap 78.9516 84.0752
    Cpp 54.5601 60.8912

    (so mmap actually turns out to be fastest)

    When I add MAP_POPULATE so as to prefault the pages, mmap gets even better:

    fread 49.8951 58.354
    fread w sbuffer 50.2688 60.7751
    fread w lbuffer 52.6344 62.8038
    read2 65.793 48.9292
    mmap 106.522 106.855
    Cpp 47.5949 59.6341

    But your point stands that it’s worth benchmarking these things.

    -Todd

  2. Itman says:

    Todd, did you mean “mmap gets even WORSE”? This is very-very strange, because in all tests that I have heard about mmap beats everything (by a wide margin). Assuming that you check with files “warmed up” and cached by the OS.

  3. Todd Lipcon says:

    No, mmap gets better – higher numbers are better here, unless I’m misreading the benchmark code.

  4. Pierre Barbier de Reuille says:

    I don’t really get why, but I also have a Intel iCore 7, and, like Todd, I find mmap to be the fastest on my machine. Here are the results:

    fread 79.1329 82.2618
    fread w sbuffer 81.6359 82.9706
    fread w lbuffer 78.7614 82.0594
    read2 73.0988 48.8926
    mmap 93.9841 94.7367
    Cpp 86.4751 79.8615

    Now, I ran it on a Linux box using debian-testing. Also, like for Todd, adding MAP_POPULATE makes mmap quite faster:

    fread 85.8116 82.6478
    fread w sbuffer 79.5079 82.5918
    fread w lbuffer 82.6412 79.6012
    read2 70.0466 46.6896
    mmap 110.734 125.265
    Cpp 82.4382 76.3553

    (and as you can see there’s quite a bit of variations from one run to the next).

    -Pierre

  5. That is correct, the little program reports the speed, so higher numbers are better.

    I don’t get better speed with mmap on any of my machines, but if you read the last paragraph of my blog post, I had expected people to get vastly different results.

    Unfortunately, IO is difficult to benchmark reliably.

    I have changed my program to use MAP_POPULATE. It does improve speed quite a bit, but even so, mmap is slower than fread on my machines.

  6. J. Andrew Rogers says:

    The default policy of mmap() is pretty poor for reading large sequential streams because mmap() has no idea what your access pattern is going to be and conservatively straddles the fence on behavior by default. Defining the behavior and policy with madvise() to something other than default is important if performance matters.

    This is one of those cases where setting madvise() to MADV_SEQUENTIAL|MADV_WILLNEED over the file should make a significant difference. In principle, mmap() with madvise() flags properly set should be as fast as any other mechanism since most other mechanisms are using something like this under the hood.

    I am not sure it is apples-to-apples to modify the default buffering behavior of the fread() case but not altering the default access policy of mmap().

  7. @Rogers

    Thanks. Even with these hints, I get that mmap is significantly slower. Here is what I get on my desktop:

    $ ./ioaccess

    fread 130.308 122.366
    fread w sbuffer 119.837 122.812
    fread w lbuffer 125.437 122.767
    read2 104.045 71.4784
    mmap 95.8698 43.1566
    fancy mmap 96.5595 77.5446
    Cpp 118.777 116.532

    where fancy mmap is what I get with madvise.

    Of course, there are variations from run to run, but mmap is never faster in my tests.

    I’m testing on a Linux destop and a mac laptop. I vary the GCC compiler version, for fun… but no luck. I always find that memory mapping is slower.

    I should stress that another reason to worry about memory mapping is how quickly it can bring down your program. For production code, hard crashes should be a concern.

  8. Itman says:

    Ohhh, I see. I (as usual) confuse milliseconds (MIs) and millions of integers per seconds (MIs).

    My results actually do match those of Daniel (on Linux/CentOS) and mmap beats everything else, but a difference is small 10-20% (with and without MAP_POPULATE).

    A more interesting scenario would be to re-use the same file many times and not to re-map the data.

  9. Tom says:

    I’m confused what you are trying to measure:

    1) Speed of shuffling data from the buffer cache (i.e. memory) into the process namespace?
    2) or speed of reading data from disk with the generic io scheduler though different interfaces (and therefore presumably different hints to the kernel reg. expected access patterns).

    If the latter did everyone running the benchmark flush their buffer caches e.g. with [1]?

    [1]
    echo 3 > /proc/sys/vm/drop_caches

  10. Peter De Wachter says:

    You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system.

  11. @Tom

    Because the file is so big, it is very unlikely that it resides in the buffer. This being said, the benchmark could be improved, that is why I post the source code on github.

    @Peter

    You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system

    On my machine, files in /tmp are on disk.

  12. Rasterman says:

    Caches, filesystem, ioscheduler, device readahead settings (/sys/devices/…/readahead_kb) etc. heavily.

    as for stability, unless you jump outside the mmaped memory bounds only one thing will crash you. Its not a segv. it’s a sigbus. You get this when the.memory is validly mapped but can’t be accessed. example when you get i/o errors on your disk. This can be handled via a sigbus signal handler. Map in a page of /dev/zero on the page with the problem, set a flag, and check this flag at least once per page read. 🙂 handle failure appropriately for your situation.

  13. KWillets says:

    fread 34.9852 37.3454
    fread w sbuffer 33.3594 37.9046
    fread w lbuffer 33.7706 38.4986
    read2 54.0629 27.0563
    mmap 35.7301 50.655
    fancy mmap 36.2806 50.3316
    Cpp 41.2026 39.5535

    I got differing results on Centos 5.2. I suspect it’s misreporting cpu times, as I would see large variations in cpu-based throughput, but similar wallclock speeds run-to-run.

    There were a few questionable things in the loop, like a vector that isn’t used, so I took it out and had minimal improvement. Changing to MAP_SHARED had about a 10% positive effect.

    Linus made some comments here: http://lkml.indiana.edu/hypermail/linux/kernel/0004.0/0728.html

    Single map-and-scan is probably the worst scenario :(.

  14. Neoh says:

    I have a laptop with intel i7 and gcc4.7 too, on Fedora Linux:

    fread 92.2916 91.798
    fread w sbuffer 75.9051 75.531
    fread w lbuffer 84.1882 83.7542
    read2 42.3798 42.2044
    mmap 99.2518 67.327
    fancy mmap 90.0927 88.8752
    mmap (shared) 89.4623 88.51
    fancy mmap (shared) 101.197 100.393
    Cpp 95.7135 95.232

  15. vicaya says:

    i/o is highly kernel and device dependent. people should post kernel versions (mmap(2) has different code paths for readahead than read(2)) and disk models (mostly because they affect the device drivers being picked up) besides cpu and compiler.

  16. Cartesius says:

    Ubuntu 12.04, 3.2.0-26-generic, ext4

    fread 91.1748 90.8086
    fread w sbuffer 93.3305 93.0044
    fread w lbuffer 94.3807 94.1626
    read2 55.2302 55.0486
    mmap 109.469 108.818
    fancy mmap 108.408 107.583
    mmap (shared) 109.469 108.861
    fancy mmap (shared) 108.233 107.534
    Cpp 100.909 100.607

  17. Cartesius says:

    Well, I’ve tuned a little bit your read(2) implementation and here are my final numbers for the above mentioned architecture. Read(2) on the __proper__ sized buffer cannot be slower than fread.

    fread 94.3807 94.0505
    fread w sbuffer 94.3807 94.098
    fread w lbuffer 94.5136 94.3137
    read2 100.607 100.331
    mmap 107.712 107.07
    fancy mmap 106.515 105.687
    mmap (shared) 107.54 107.07
    fancy mmap (shared) 106.685 105.791
    Cpp 95.4547 95.1243

  18. Cartesius says:

    Hi Daniel,
    sorry for three posts but after another tuning I finally have the expected results, that is, the read(2) and mmap(2) should provide pretty comparable results for this specific task. It was said that mmap is much better suited for repeated and random reading.
    Here are my tuned results: 🙂

    fread 93.3305 93.0661
    fread w sbuffer 93.5909 93.226
    fread w lbuffer 94.116 93.8494
    read2 105.18 104.814
    mmap 105.345 104.869
    fancy mmap 104.687 104.045
    mmap (shared) 105.51 104.924
    fancy mmap (shared) 104.687 104.047
    Cpp 100.456 100.151

  19. maxime caron says:

    Cartesius : What kind of tuning did you do? Could you please share the information with us please!

  20. Cartesius says:

    Maxime: In testread:
    1. I changed the for-cycle into while-cycle and changed the conditions != sizeof(…) which I guess are not correct because read(2) syscall can give you also a partial result. Definitely on network socket, maybe also on block device.
    2. I removed the first read(2) in the cycle which slows down whole computation.
    3. I read the blocksize in a bigger chunk of data.
    4. All reads are performed with a fixed IO buffer of size, say 64kB, no repeated vector.resize calls.

    And as I’ve said, read(2) by definition, cannot be any slower for this kind of scenario.

    And the results prove it.

  21. As I’m learning Go at the moment I converted the code here:

    https://gist.github.com/3172562

    Interestingly it runs at almost identical speed to the “basic sum (C++-like)” using gcc 4.6.3 and g++ -funroll-loops -O3 -o cumuls cumuls.cpp

    Go’s compiler isn’t particularly well optimised at the moment but I thought it did OK here.

  22. Cartesius says:

    Nick:
    Good to know. But I guess that this particulary simple scenario isn’t a tough job for compiler because I assume that Go uses many well-implemented library functions e.g. for I/O.

  23. Actually I rather stupidly posted that comment on the wrong blog post so please ignore!

  24. jg says:

    nmap

    Intel pentium dual-core t2390
    debian 6.0.5
    gcc 4.4.5

    fread 12.7745 12.7489
    fread w sbuffer 12.8726 12.798
    fread w lbuffer 12.7382 12.5046
    read2 10.6465 10.5533
    mmap 15.8191 15.7597
    fancy mmap 15.849 15.7929
    mmap (shared) 15.7931 15.4621
    fancy mmap (shared) 15.6933 15.6158
    Cpp 11.1488 11.0111

    fread 12.8775 12.8624
    fread w sbuffer 12.8676 12.8215
    fread w lbuffer 12.6327 12.1366
    read2 10.5345 10.4342
    mmap 15.8602 15.8455
    fancy mmap 15.8228 15.7794
    mmap (shared) 15.6239 14.9385
    fancy mmap (shared) 15.5227 15.3789
    Cpp 11.1174 10.9043

  25. Jacek says:

    When reading from file with your own buffering you’ve got 3 levels of buffering overall:
    1. Your buffer.
    2. glib buffer for files
    3. Kernel buffer managing pages
    And then you have a mmap function, which allows you directly read from pages avoiding glibc buffering.
    Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow?

    The answer lies in the bad benchmark code, giving false results. Cartesius has already pointed some hints about read() fix. Another hint would be opening all files first, setting them buffering (setvbuf) to avoid allocating space by library. Another step would be just to get timings after files were opened plus using bigger buffers. With small amount of data to be read the obvious winner is mmap, but the situation can change with bigger buffers and sequential reading, which can be interesting experiment.

  26. @Jacek

    Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow? (…) The answer lies in the bad benchmark code, giving false results.

    It is one thing to claim that the benchmark is faulty, it is another to propose a better one. The latter action is much more useful.

  27. Gary says:

    Thanks for the code. I ported this to Windows and tested on Windows 8 and got the following on a Lenovo T440 i5 4300U CPU, 2.494 Ghz, 2 cores, 4 logical processors, LITEONIT LCS-256M6S drive:

    fread 12.6044 9.21667
    fread w sbuffer 13.7111 9.64173
    fread w lbuffer 13.9331 11.7859
    read2 28.9731 21.0076
    mmap 30.459 21.5386
    Cpp 13.5171 9.56955

    With read2 modified as Cartesius described and to read in 4K blocks. So not much benefit of mmap over read2.

    We have a product that is crashing because of memory mapped IO and it’s running out of memory. If it were a buffered read instead then virtual memory would allow the product to continue to run without issue. Because it is a rather large product, there are numerous places in the code with pointer arithmetic making, making changing from memory mapped IO time consuming. It sure would be nice to have a class that allows you to easily turn on/off memory mapped IO. 🙂

    1. Would you share your Visual Studio port?

  28. arthur says:

    If this is still interesting, some time while ago, I was interesting to find out how much shuffling data around between buffers affects different std C funcions (fgetc, fgets, fread). I found that fread with a bigger buffer (around 16k – 4xBUFSIZ on my machine) gave best results. I didn’t compared with memory mapped files since I was just interested to see how much overhead is between user program and library buffers. I am not sure how good my benchmark is, I am just as an amateur, but post with code can be seen at: http://www.nextpoint.se/?p=540 .

  29. Uni says:

    An interesting fact: Intel compiler can boost the speed of mmap greatly.

    On my i7-3770 machine with Ubuntu 17.10, using gcc 7.2.0, the best result is:
    fread 119.978 119.982
    fread w sbuffer 120.694 120.697
    fread w lbuffer 120.729 120.733
    read2 46.7633 46.7649
    mmap 223.929 223.932
    fancy mmap 223.969 223.972
    mmap (shared) 223.779 223.783
    fancy mmap (shared) 224.043 224.048
    Cpp 145.988 145.99

    Using Intel C++ compiler 18.0.1, the result is:
    fread 88.4261 88.4285
    fread w sbuffer 87.6551 87.6574
    fread w lbuffer 87.7999 87.8023
    read2 45.4239 45.4254
    mmap 790.271 790.232
    fancy mmap 790.274 790.274
    mmap (shared) 790.195 790.197
    fancy mmap (shared) 791.301 791.208
    Cpp 148.785 148.789

    Yes, mmap gained a nearly 4x speedup! I have verified on two other machines, they also reported at least 3x speedup.

  30. P Wilson says:

    So I have a lot of experience with mmap and unfortunately how fast it works is highly dependent on the OS and what bells or whistles you have installed. Years ago Digital Unix had a well deserved reputation for being the fastest mmap. It was (and no doubt still is) just blazing fast. In fact Unix in general tends to have high quality fast mmap implementations. When I moved to the Linux world it was like being kicked by a horse. Memory mapping was either poorly implemented or just plain missing! It has slowly improved over the years but still leaves a lot to be desired on many versions of Linux. Memory mapping on windows has generally been okay, but there again they have been working on it and it has improved over time. I wish I had numbers from a good Digital Unix machine to blow your minds with but alas I don’t… sorry.

  31. Tarek says:

    Results for Nvidia Jetson AGX Xavier:

    fread 26.994 26.900
    fread w sbuffer 22.935 22.842
    fread w lbuffer 22.966 22.877
    read2 20.435 20.363
    mmap 198.547 196.867
    fancy mmap 195.653 193.653
    mmap (shared) 192.842 191.358
    fancy mmap (shared) 195.653 193.608
    Cpp 27.730 27.624

  32. Konstantin says:
  33. Robert Wishlaw says:

    Fedora 39
    gcc version 13.2.1 20231011 (Red Hat 13.2.1-4)
    Ryzen 9 5950x

    fread 153.758 153.323

    fread w sbuffer 159.076 158.646

    fread w lbuffer 159.097 158.67

    read2 61.7936 61.6302

    mmap 387.872 386.8

    fancy mmap 384.261 383.197

    mmap (shared) 387.046 385.94

    fancy mmap (shared) 383.397 382.336

    Cpp 172.464 171.993

    fread 156.403 155.976

    fread w sbuffer 158.51 158.086

    fread w lbuffer 159.517 159.092

    read2 61.7716 61.6078

    mmap 388.036 386.946

    fancy mmap 384.697 383.622

    mmap (shared) 388.152 387.082

    fancy mmap (shared) 383.987 382.92

    Cpp 172.267 171.796