But your point stands that it’s worth benchmarking these things.
-Todd
Itmansays:
Todd, did you mean “mmap gets even WORSE”? This is very-very strange, because in all tests that I have heard about mmap beats everything (by a wide margin). Assuming that you check with files “warmed up” and cached by the OS.
Todd Lipconsays:
No, mmap gets better – higher numbers are better here, unless I’m misreading the benchmark code.
Pierre Barbier de Reuillesays:
I don’t really get why, but I also have a Intel iCore 7, and, like Todd, I find mmap to be the fastest on my machine. Here are the results:
That is correct, the little program reports the speed, so higher numbers are better.
I don’t get better speed with mmap on any of my machines, but if you read the last paragraph of my blog post, I had expected people to get vastly different results.
Unfortunately, IO is difficult to benchmark reliably.
I have changed my program to use MAP_POPULATE. It does improve speed quite a bit, but even so, mmap is slower than fread on my machines.
J. Andrew Rogerssays:
The default policy of mmap() is pretty poor for reading large sequential streams because mmap() has no idea what your access pattern is going to be and conservatively straddles the fence on behavior by default. Defining the behavior and policy with madvise() to something other than default is important if performance matters.
This is one of those cases where setting madvise() to MADV_SEQUENTIAL|MADV_WILLNEED over the file should make a significant difference. In principle, mmap() with madvise() flags properly set should be as fast as any other mechanism since most other mechanisms are using something like this under the hood.
I am not sure it is apples-to-apples to modify the default buffering behavior of the fread() case but not altering the default access policy of mmap().
Of course, there are variations from run to run, but mmap is never faster in my tests.
I’m testing on a Linux destop and a mac laptop. I vary the GCC compiler version, for fun… but no luck. I always find that memory mapping is slower.
I should stress that another reason to worry about memory mapping is how quickly it can bring down your program. For production code, hard crashes should be a concern.
Itmansays:
Ohhh, I see. I (as usual) confuse milliseconds (MIs) and millions of integers per seconds (MIs).
My results actually do match those of Daniel (on Linux/CentOS) and mmap beats everything else, but a difference is small 10-20% (with and without MAP_POPULATE).
A more interesting scenario would be to re-use the same file many times and not to re-map the data.
Tomsays:
I’m confused what you are trying to measure:
1) Speed of shuffling data from the buffer cache (i.e. memory) into the process namespace?
2) or speed of reading data from disk with the generic io scheduler though different interfaces (and therefore presumably different hints to the kernel reg. expected access patterns).
If the latter did everyone running the benchmark flush their buffer caches e.g. with [1]?
[1]
echo 3 > /proc/sys/vm/drop_caches
Peter De Wachtersays:
You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system.
Because the file is so big, it is very unlikely that it resides in the buffer. This being said, the benchmark could be improved, that is why I post the source code on github.
@Peter
You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system
On my machine, files in /tmp are on disk.
Rastermansays:
Caches, filesystem, ioscheduler, device readahead settings (/sys/devices/…/readahead_kb) etc. heavily.
as for stability, unless you jump outside the mmaped memory bounds only one thing will crash you. Its not a segv. it’s a sigbus. You get this when the.memory is validly mapped but can’t be accessed. example when you get i/o errors on your disk. This can be handled via a sigbus signal handler. Map in a page of /dev/zero on the page with the problem, set a flag, and check this flag at least once per page read. 🙂 handle failure appropriately for your situation.
I got differing results on Centos 5.2. I suspect it’s misreporting cpu times, as I would see large variations in cpu-based throughput, but similar wallclock speeds run-to-run.
There were a few questionable things in the loop, like a vector that isn’t used, so I took it out and had minimal improvement. Changing to MAP_SHARED had about a 10% positive effect.
i/o is highly kernel and device dependent. people should post kernel versions (mmap(2) has different code paths for readahead than read(2)) and disk models (mostly because they affect the device drivers being picked up) besides cpu and compiler.
Well, I’ve tuned a little bit your read(2) implementation and here are my final numbers for the above mentioned architecture. Read(2) on the __proper__ sized buffer cannot be slower than fread.
Hi Daniel,
sorry for three posts but after another tuning I finally have the expected results, that is, the read(2) and mmap(2) should provide pretty comparable results for this specific task. It was said that mmap is much better suited for repeated and random reading.
Here are my tuned results: 🙂
Cartesius : What kind of tuning did you do? Could you please share the information with us please!
Cartesiussays:
Maxime: In testread:
1. I changed the for-cycle into while-cycle and changed the conditions != sizeof(…) which I guess are not correct because read(2) syscall can give you also a partial result. Definitely on network socket, maybe also on block device.
2. I removed the first read(2) in the cycle which slows down whole computation.
3. I read the blocksize in a bigger chunk of data.
4. All reads are performed with a fixed IO buffer of size, say 64kB, no repeated vector.resize calls.
And as I’ve said, read(2) by definition, cannot be any slower for this kind of scenario.
Interestingly it runs at almost identical speed to the “basic sum (C++-like)” using gcc 4.6.3 and g++ -funroll-loops -O3 -o cumuls cumuls.cpp
Go’s compiler isn’t particularly well optimised at the moment but I thought it did OK here.
Cartesiussays:
Nick:
Good to know. But I guess that this particulary simple scenario isn’t a tough job for compiler because I assume that Go uses many well-implemented library functions e.g. for I/O.
When reading from file with your own buffering you’ve got 3 levels of buffering overall:
1. Your buffer.
2. glib buffer for files
3. Kernel buffer managing pages
And then you have a mmap function, which allows you directly read from pages avoiding glibc buffering.
Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow?
The answer lies in the bad benchmark code, giving false results. Cartesius has already pointed some hints about read() fix. Another hint would be opening all files first, setting them buffering (setvbuf) to avoid allocating space by library. Another step would be just to get timings after files were opened plus using bigger buffers. With small amount of data to be read the obvious winner is mmap, but the situation can change with bigger buffers and sequential reading, which can be interesting experiment.
Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow? (…) The answer lies in the bad benchmark code, giving false results.
It is one thing to claim that the benchmark is faulty, it is another to propose a better one. The latter action is much more useful.
Garysays:
Thanks for the code. I ported this to Windows and tested on Windows 8 and got the following on a Lenovo T440 i5 4300U CPU, 2.494 Ghz, 2 cores, 4 logical processors, LITEONIT LCS-256M6S drive:
With read2 modified as Cartesius described and to read in 4K blocks. So not much benefit of mmap over read2.
We have a product that is crashing because of memory mapped IO and it’s running out of memory. If it were a buffered read instead then virtual memory would allow the product to continue to run without issue. Because it is a rather large product, there are numerous places in the code with pointer arithmetic making, making changing from memory mapped IO time consuming. It sure would be nice to have a class that allows you to easily turn on/off memory mapped IO. 🙂
If this is still interesting, some time while ago, I was interesting to find out how much shuffling data around between buffers affects different std C funcions (fgetc, fgets, fread). I found that fread with a bigger buffer (around 16k – 4xBUFSIZ on my machine) gave best results. I didn’t compared with memory mapped files since I was just interested to see how much overhead is between user program and library buffers. I am not sure how good my benchmark is, I am just as an amateur, but post with code can be seen at: http://www.nextpoint.se/?p=540 .
Unisays:
An interesting fact: Intel compiler can boost the speed of mmap greatly.
On my i7-3770 machine with Ubuntu 17.10, using gcc 7.2.0, the best result is:
fread 119.978 119.982
fread w sbuffer 120.694 120.697
fread w lbuffer 120.729 120.733
read2 46.7633 46.7649
mmap 223.929 223.932
fancy mmap 223.969 223.972
mmap (shared) 223.779 223.783
fancy mmap (shared) 224.043 224.048
Cpp 145.988 145.99
Using Intel C++ compiler 18.0.1, the result is:
fread 88.4261 88.4285
fread w sbuffer 87.6551 87.6574
fread w lbuffer 87.7999 87.8023
read2 45.4239 45.4254
mmap 790.271 790.232
fancy mmap 790.274 790.274
mmap (shared) 790.195 790.197
fancy mmap (shared) 791.301 791.208
Cpp 148.785 148.789
Yes, mmap gained a nearly 4x speedup! I have verified on two other machines, they also reported at least 3x speedup.
P Wilsonsays:
So I have a lot of experience with mmap and unfortunately how fast it works is highly dependent on the OS and what bells or whistles you have installed. Years ago Digital Unix had a well deserved reputation for being the fastest mmap. It was (and no doubt still is) just blazing fast. In fact Unix in general tends to have high quality fast mmap implementations. When I moved to the Linux world it was like being kicked by a horse. Memory mapping was either poorly implemented or just plain missing! It has slowly improved over the years but still leaves a lot to be desired on many versions of Linux. Memory mapping on windows has generally been okay, but there again they have been working on it and it has improved over time. I wish I had numbers from a good Digital Unix machine to blow your minds with but alas I don’t… sorry.
Results on my code i7 laptop don’t match up:
fread 52.8416 63.133
fread w sbuffer 53.9027 64.6808
fread w lbuffer 55.4619 63.2864
read2 73.746 49.1903
mmap 78.9516 84.0752
Cpp 54.5601 60.8912
(so mmap actually turns out to be fastest)
When I add MAP_POPULATE so as to prefault the pages, mmap gets even better:
fread 49.8951 58.354
fread w sbuffer 50.2688 60.7751
fread w lbuffer 52.6344 62.8038
read2 65.793 48.9292
mmap 106.522 106.855
Cpp 47.5949 59.6341
But your point stands that it’s worth benchmarking these things.
-Todd
Todd, did you mean “mmap gets even WORSE”? This is very-very strange, because in all tests that I have heard about mmap beats everything (by a wide margin). Assuming that you check with files “warmed up” and cached by the OS.
No, mmap gets better – higher numbers are better here, unless I’m misreading the benchmark code.
I don’t really get why, but I also have a Intel iCore 7, and, like Todd, I find mmap to be the fastest on my machine. Here are the results:
fread 79.1329 82.2618
fread w sbuffer 81.6359 82.9706
fread w lbuffer 78.7614 82.0594
read2 73.0988 48.8926
mmap 93.9841 94.7367
Cpp 86.4751 79.8615
Now, I ran it on a Linux box using debian-testing. Also, like for Todd, adding MAP_POPULATE makes mmap quite faster:
fread 85.8116 82.6478
fread w sbuffer 79.5079 82.5918
fread w lbuffer 82.6412 79.6012
read2 70.0466 46.6896
mmap 110.734 125.265
Cpp 82.4382 76.3553
(and as you can see there’s quite a bit of variations from one run to the next).
-Pierre
That is correct, the little program reports the speed, so higher numbers are better.
I don’t get better speed with mmap on any of my machines, but if you read the last paragraph of my blog post, I had expected people to get vastly different results.
Unfortunately, IO is difficult to benchmark reliably.
I have changed my program to use MAP_POPULATE. It does improve speed quite a bit, but even so, mmap is slower than fread on my machines.
The default policy of mmap() is pretty poor for reading large sequential streams because mmap() has no idea what your access pattern is going to be and conservatively straddles the fence on behavior by default. Defining the behavior and policy with madvise() to something other than default is important if performance matters.
This is one of those cases where setting madvise() to MADV_SEQUENTIAL|MADV_WILLNEED over the file should make a significant difference. In principle, mmap() with madvise() flags properly set should be as fast as any other mechanism since most other mechanisms are using something like this under the hood.
I am not sure it is apples-to-apples to modify the default buffering behavior of the fread() case but not altering the default access policy of mmap().
@Rogers
Thanks. Even with these hints, I get that mmap is significantly slower. Here is what I get on my desktop:
$ ./ioaccess
fread 130.308 122.366
fread w sbuffer 119.837 122.812
fread w lbuffer 125.437 122.767
read2 104.045 71.4784
mmap 95.8698 43.1566
fancy mmap 96.5595 77.5446
Cpp 118.777 116.532
where fancy mmap is what I get with madvise.
Of course, there are variations from run to run, but mmap is never faster in my tests.
I’m testing on a Linux destop and a mac laptop. I vary the GCC compiler version, for fun… but no luck. I always find that memory mapping is slower.
I should stress that another reason to worry about memory mapping is how quickly it can bring down your program. For production code, hard crashes should be a concern.
Ohhh, I see. I (as usual) confuse milliseconds (MIs) and millions of integers per seconds (MIs).
My results actually do match those of Daniel (on Linux/CentOS) and mmap beats everything else, but a difference is small 10-20% (with and without MAP_POPULATE).
A more interesting scenario would be to re-use the same file many times and not to re-map the data.
I’m confused what you are trying to measure:
1) Speed of shuffling data from the buffer cache (i.e. memory) into the process namespace?
2) or speed of reading data from disk with the generic io scheduler though different interfaces (and therefore presumably different hints to the kernel reg. expected access patterns).
If the latter did everyone running the benchmark flush their buffer caches e.g. with [1]?
[1]
echo 3 > /proc/sys/vm/drop_caches
You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system.
@Tom
Because the file is so big, it is very unlikely that it resides in the buffer. This being said, the benchmark could be improved, that is why I post the source code on github.
@Peter
You’re testing with a file in /tmp. I suspect there’ll be a big difference between a tmpfs and a disk-based file system
On my machine, files in /tmp are on disk.
Caches, filesystem, ioscheduler, device readahead settings (/sys/devices/…/readahead_kb) etc. heavily.
as for stability, unless you jump outside the mmaped memory bounds only one thing will crash you. Its not a segv. it’s a sigbus. You get this when the.memory is validly mapped but can’t be accessed. example when you get i/o errors on your disk. This can be handled via a sigbus signal handler. Map in a page of /dev/zero on the page with the problem, set a flag, and check this flag at least once per page read. 🙂 handle failure appropriately for your situation.
fread 34.9852 37.3454
fread w sbuffer 33.3594 37.9046
fread w lbuffer 33.7706 38.4986
read2 54.0629 27.0563
mmap 35.7301 50.655
fancy mmap 36.2806 50.3316
Cpp 41.2026 39.5535
I got differing results on Centos 5.2. I suspect it’s misreporting cpu times, as I would see large variations in cpu-based throughput, but similar wallclock speeds run-to-run.
There were a few questionable things in the loop, like a vector that isn’t used, so I took it out and had minimal improvement. Changing to MAP_SHARED had about a 10% positive effect.
Linus made some comments here: http://lkml.indiana.edu/hypermail/linux/kernel/0004.0/0728.html
Single map-and-scan is probably the worst scenario :(.
I have a laptop with intel i7 and gcc4.7 too, on Fedora Linux:
fread 92.2916 91.798
fread w sbuffer 75.9051 75.531
fread w lbuffer 84.1882 83.7542
read2 42.3798 42.2044
mmap 99.2518 67.327
fancy mmap 90.0927 88.8752
mmap (shared) 89.4623 88.51
fancy mmap (shared) 101.197 100.393
Cpp 95.7135 95.232
i/o is highly kernel and device dependent. people should post kernel versions (mmap(2) has different code paths for readahead than read(2)) and disk models (mostly because they affect the device drivers being picked up) besides cpu and compiler.
Ubuntu 12.04, 3.2.0-26-generic, ext4
fread 91.1748 90.8086
fread w sbuffer 93.3305 93.0044
fread w lbuffer 94.3807 94.1626
read2 55.2302 55.0486
mmap 109.469 108.818
fancy mmap 108.408 107.583
mmap (shared) 109.469 108.861
fancy mmap (shared) 108.233 107.534
Cpp 100.909 100.607
Well, I’ve tuned a little bit your read(2) implementation and here are my final numbers for the above mentioned architecture. Read(2) on the __proper__ sized buffer cannot be slower than fread.
fread 94.3807 94.0505
fread w sbuffer 94.3807 94.098
fread w lbuffer 94.5136 94.3137
read2 100.607 100.331
mmap 107.712 107.07
fancy mmap 106.515 105.687
mmap (shared) 107.54 107.07
fancy mmap (shared) 106.685 105.791
Cpp 95.4547 95.1243
Hi Daniel,
sorry for three posts but after another tuning I finally have the expected results, that is, the read(2) and mmap(2) should provide pretty comparable results for this specific task. It was said that mmap is much better suited for repeated and random reading.
Here are my tuned results: 🙂
fread 93.3305 93.0661
fread w sbuffer 93.5909 93.226
fread w lbuffer 94.116 93.8494
read2 105.18 104.814
mmap 105.345 104.869
fancy mmap 104.687 104.045
mmap (shared) 105.51 104.924
fancy mmap (shared) 104.687 104.047
Cpp 100.456 100.151
Cartesius : What kind of tuning did you do? Could you please share the information with us please!
Maxime: In testread:
1. I changed the for-cycle into while-cycle and changed the conditions != sizeof(…) which I guess are not correct because read(2) syscall can give you also a partial result. Definitely on network socket, maybe also on block device.
2. I removed the first read(2) in the cycle which slows down whole computation.
3. I read the blocksize in a bigger chunk of data.
4. All reads are performed with a fixed IO buffer of size, say 64kB, no repeated vector.resize calls.
And as I’ve said, read(2) by definition, cannot be any slower for this kind of scenario.
And the results prove it.
As I’m learning Go at the moment I converted the code here:
https://gist.github.com/3172562
Interestingly it runs at almost identical speed to the “basic sum (C++-like)” using gcc 4.6.3 and g++ -funroll-loops -O3 -o cumuls cumuls.cpp
Go’s compiler isn’t particularly well optimised at the moment but I thought it did OK here.
Nick:
Good to know. But I guess that this particulary simple scenario isn’t a tough job for compiler because I assume that Go uses many well-implemented library functions e.g. for I/O.
Actually I rather stupidly posted that comment on the wrong blog post so please ignore!
nmap
Intel pentium dual-core t2390
debian 6.0.5
gcc 4.4.5
fread 12.7745 12.7489
fread w sbuffer 12.8726 12.798
fread w lbuffer 12.7382 12.5046
read2 10.6465 10.5533
mmap 15.8191 15.7597
fancy mmap 15.849 15.7929
mmap (shared) 15.7931 15.4621
fancy mmap (shared) 15.6933 15.6158
Cpp 11.1488 11.0111
fread 12.8775 12.8624
fread w sbuffer 12.8676 12.8215
fread w lbuffer 12.6327 12.1366
read2 10.5345 10.4342
mmap 15.8602 15.8455
fancy mmap 15.8228 15.7794
mmap (shared) 15.6239 14.9385
fancy mmap (shared) 15.5227 15.3789
Cpp 11.1174 10.9043
When reading from file with your own buffering you’ve got 3 levels of buffering overall:
1. Your buffer.
2. glib buffer for files
3. Kernel buffer managing pages
And then you have a mmap function, which allows you directly read from pages avoiding glibc buffering.
Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow?
The answer lies in the bad benchmark code, giving false results. Cartesius has already pointed some hints about read() fix. Another hint would be opening all files first, setting them buffering (setvbuf) to avoid allocating space by library. Another step would be just to get timings after files were opened plus using bigger buffers. With small amount of data to be read the obvious winner is mmap, but the situation can change with bigger buffers and sequential reading, which can be interesting experiment.
@Jacek
Do you believe that when reading directly from low-level buffers, avoiding library and system calls plus various checks and then reading such small amounts of data is so slow? (…) The answer lies in the bad benchmark code, giving false results.
It is one thing to claim that the benchmark is faulty, it is another to propose a better one. The latter action is much more useful.
Thanks for the code. I ported this to Windows and tested on Windows 8 and got the following on a Lenovo T440 i5 4300U CPU, 2.494 Ghz, 2 cores, 4 logical processors, LITEONIT LCS-256M6S drive:
fread 12.6044 9.21667
fread w sbuffer 13.7111 9.64173
fread w lbuffer 13.9331 11.7859
read2 28.9731 21.0076
mmap 30.459 21.5386
Cpp 13.5171 9.56955
With read2 modified as Cartesius described and to read in 4K blocks. So not much benefit of mmap over read2.
We have a product that is crashing because of memory mapped IO and it’s running out of memory. If it were a buffered read instead then virtual memory would allow the product to continue to run without issue. Because it is a rather large product, there are numerous places in the code with pointer arithmetic making, making changing from memory mapped IO time consuming. It sure would be nice to have a class that allows you to easily turn on/off memory mapped IO. 🙂
Would you share your Visual Studio port?
If this is still interesting, some time while ago, I was interesting to find out how much shuffling data around between buffers affects different std C funcions (fgetc, fgets, fread). I found that fread with a bigger buffer (around 16k – 4xBUFSIZ on my machine) gave best results. I didn’t compared with memory mapped files since I was just interested to see how much overhead is between user program and library buffers. I am not sure how good my benchmark is, I am just as an amateur, but post with code can be seen at: http://www.nextpoint.se/?p=540 .
An interesting fact: Intel compiler can boost the speed of mmap greatly.
On my i7-3770 machine with Ubuntu 17.10, using gcc 7.2.0, the best result is:
fread 119.978 119.982
fread w sbuffer 120.694 120.697
fread w lbuffer 120.729 120.733
read2 46.7633 46.7649
mmap 223.929 223.932
fancy mmap 223.969 223.972
mmap (shared) 223.779 223.783
fancy mmap (shared) 224.043 224.048
Cpp 145.988 145.99
Using Intel C++ compiler 18.0.1, the result is:
fread 88.4261 88.4285
fread w sbuffer 87.6551 87.6574
fread w lbuffer 87.7999 87.8023
read2 45.4239 45.4254
mmap 790.271 790.232
fancy mmap 790.274 790.274
mmap (shared) 790.195 790.197
fancy mmap (shared) 791.301 791.208
Cpp 148.785 148.789
Yes, mmap gained a nearly 4x speedup! I have verified on two other machines, they also reported at least 3x speedup.
So I have a lot of experience with mmap and unfortunately how fast it works is highly dependent on the OS and what bells or whistles you have installed. Years ago Digital Unix had a well deserved reputation for being the fastest mmap. It was (and no doubt still is) just blazing fast. In fact Unix in general tends to have high quality fast mmap implementations. When I moved to the Linux world it was like being kicked by a horse. Memory mapping was either poorly implemented or just plain missing! It has slowly improved over the years but still leaves a lot to be desired on many versions of Linux. Memory mapping on windows has generally been okay, but there again they have been working on it and it has improved over time. I wish I had numbers from a good Digital Unix machine to blow your minds with but alas I don’t… sorry.
Results for Nvidia Jetson AGX Xavier:
fread 26.994 26.900
fread w sbuffer 22.935 22.842
fread w lbuffer 22.966 22.877
read2 20.435 20.363
mmap 198.547 196.867
fancy mmap 195.653 193.653
mmap (shared) 192.842 191.358
fancy mmap (shared) 195.653 193.608
Cpp 27.730 27.624
It looks like under the hood glibc can switch you to mmap.
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofopen.c;h=965d21cd978f3acb25ca23152993d9cac9f120e3;hb=HEAD#l36
Fedora 39
gcc version 13.2.1 20231011 (Red Hat 13.2.1-4)
Ryzen 9 5950x
fread 153.758 153.323
fread w sbuffer 159.076 158.646
fread w lbuffer 159.097 158.67
read2 61.7936 61.6302
mmap 387.872 386.8
fancy mmap 384.261 383.197
mmap (shared) 387.046 385.94
fancy mmap (shared) 383.397 382.336
Cpp 172.464 171.993
fread 156.403 155.976
fread w sbuffer 158.51 158.086
fread w lbuffer 159.517 159.092
read2 61.7716 61.6078
mmap 388.036 386.946
fancy mmap 384.697 383.622
mmap (shared) 388.152 387.082
fancy mmap (shared) 383.987 382.92
Cpp 172.267 171.796