My best interpretation of the result is that they are dominated by the Linux kernel’s mmap_sem. That lock protects all address space operations (mmap, munmap, mprotect, etc.) plus a few seemingly unrelated things. And importantly it is a reader-writer-lock. Mmap is considered a writer, so it will be single-threaded. Page fault is considered a reader, so you can fault in pages on all CPUs in parallel.
In the previous benchmark, you presumably ended up calling mmap with MAP_POPULATE, so the entire memory allocation was done in one large critical section. This benchmark presumably does not, so the kernel will only give you virtual memory without any actual memory backing things. Then your initialization loop is faulting pages in one-by-one.
Transparant huge pages should be faster once the page fault is handling 2MiB at a time, not 4kiB. Might take a while until this kicks in and you might get different results if the system has been running long enough for fragmentation to kick in. If you cannot find 512 contiguous small pages, you cannot create a huge page.
Back to the mmap_sem, your userspace code still looks single-threaded. But the kernel page fault handler can grab a free page and return, while another kernel thread on another CPU detects that the free page pool is getting low, runs the LRU list and creates new free pages. With careful profiling you might detect the kernel background work.
And if my theory is correct, you should get noticeably worse performance if you reboot with a single CPU and rerun the test.
Wilcosays:
The difference is simply because of the huge overhead of trapping into the kernel to allocate and zero a page (and that overhead has increased due to all the security issues).
With huge pages the number of page faults reduces dramatically, so we’re limited by the speed of memset. To improve performance various systems use a larger default pagesize: eg. iOS uses 16KB, and some AArch64 servers use 64KB.
Note pages are always cleared on a pagefault in Linux (even huge pages), unlike on Windows there is no background process which clears pages.
65536 pages 256 MB calloc 92.287 ms 2.71 GB/s
65536 pages 256 MB new_and_touch 87.187 ms 2.87 GB/s
65536 pages 256 MB new_and_memset 106.538 ms 2.35 GB/s
65536 pages 256 MB new_and_value_init 188.074 ms 1.33 GB/s
65536 pages 256 MB new_and_value_init_nothrow 191.691 ms 1.30 GB/s
65536 pages 256 MB memset_existing_allocation 22.110 ms 11.31 GB/s
65536 pages 256 MB mempy_into_existing_allocation 32.831 ms 7.61 GB/s
65536 pages 256 MB mmap_populate 52.721 ms 4.74 GB/s
With mallopt(M_MMAP_MAX, 0); results for 256 MB changed:
65536 pages 256 MB calloc 92.635 ms 2.70 GB/s
65536 pages 256 MB new_and_touch 0.876 ms 285.44 GB/s
65536 pages 256 MB new_and_memset 22.872 ms 10.93 GB/s
65536 pages 256 MB new_and_value_init 186.411 ms 1.34 GB/s
65536 pages 256 MB new_and_value_init_nothrow 189.933 ms 1.32 GB/s
65536 pages 256 MB memset_existing_allocation 23.459 ms 10.66 GB/s
65536 pages 256 MB mempy_into_existing_allocation 33.645 ms 7.43 GB/s
65536 pages 256 MB mmap_populate 56.151 ms 4.45 GB/s
Measurement values for larger allocations don’t show this difference.
I wonder if there is some temporal influence between the different allocation methods.
F Rsays:
If you compare the code generated by the compiler in -O2 vs -O3 (https://godbolt.org/z/t2ifR7) you can see why -O2 lacks a bit in the speed department:
-O2: Writes one byte element at a time
-O3: Writes a qword worth of elements at a time (similar to what an optimized memset() would do)
Wilcosays:
Yes that’s a GCC bug – GCC9.2 still gets it wrong, but it works as expected on GCC10 trunk. It’s never a good idea to emit a simple byte loop.
Theophilos Pisokassays:
I have trouble reproducing the results. I am using Ubuntu 18.04, Piledriver processor, GNU gcc 8.3.1.
My best interpretation of the result is that they are dominated by the Linux kernel’s mmap_sem. That lock protects all address space operations (mmap, munmap, mprotect, etc.) plus a few seemingly unrelated things. And importantly it is a reader-writer-lock. Mmap is considered a writer, so it will be single-threaded. Page fault is considered a reader, so you can fault in pages on all CPUs in parallel.
In the previous benchmark, you presumably ended up calling mmap with MAP_POPULATE, so the entire memory allocation was done in one large critical section. This benchmark presumably does not, so the kernel will only give you virtual memory without any actual memory backing things. Then your initialization loop is faulting pages in one-by-one.
Transparant huge pages should be faster once the page fault is handling 2MiB at a time, not 4kiB. Might take a while until this kicks in and you might get different results if the system has been running long enough for fragmentation to kick in. If you cannot find 512 contiguous small pages, you cannot create a huge page.
Back to the mmap_sem, your userspace code still looks single-threaded. But the kernel page fault handler can grab a free page and return, while another kernel thread on another CPU detects that the free page pool is getting low, runs the LRU list and creates new free pages. With careful profiling you might detect the kernel background work.
And if my theory is correct, you should get noticeably worse performance if you reboot with a single CPU and rerun the test.
The difference is simply because of the huge overhead of trapping into the kernel to allocate and zero a page (and that overhead has increased due to all the security issues).
With huge pages the number of page faults reduces dramatically, so we’re limited by the speed of memset. To improve performance various systems use a larger default pagesize: eg. iOS uses 16KB, and some AArch64 servers use 64KB.
Note pages are always cleared on a pagefault in Linux (even huge pages), unlike on Windows there is no background process which clears pages.
Thank you for your nice article.
I performed some further experiments by tuning malloc parameters. https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html
My Baseline
65536 pages 256 MB calloc 92.287 ms 2.71 GB/s
65536 pages 256 MB new_and_touch 87.187 ms 2.87 GB/s
65536 pages 256 MB new_and_memset 106.538 ms 2.35 GB/s
65536 pages 256 MB new_and_value_init 188.074 ms 1.33 GB/s
65536 pages 256 MB new_and_value_init_nothrow 191.691 ms 1.30 GB/s
65536 pages 256 MB memset_existing_allocation 22.110 ms 11.31 GB/s
65536 pages 256 MB mempy_into_existing_allocation 32.831 ms 7.61 GB/s
65536 pages 256 MB mmap_populate 52.721 ms 4.74 GB/s
With mallopt(M_MMAP_MAX, 0); results for 256 MB changed:
65536 pages 256 MB calloc 92.635 ms 2.70 GB/s
65536 pages 256 MB new_and_touch 0.876 ms 285.44 GB/s
65536 pages 256 MB new_and_memset 22.872 ms 10.93 GB/s
65536 pages 256 MB new_and_value_init 186.411 ms 1.34 GB/s
65536 pages 256 MB new_and_value_init_nothrow 189.933 ms 1.32 GB/s
65536 pages 256 MB memset_existing_allocation 23.459 ms 10.66 GB/s
65536 pages 256 MB mempy_into_existing_allocation 33.645 ms 7.43 GB/s
65536 pages 256 MB mmap_populate 56.151 ms 4.45 GB/s
Measurement values for larger allocations don’t show this difference.
I wonder if there is some temporal influence between the different allocation methods.
If you compare the code generated by the compiler in -O2 vs -O3 (https://godbolt.org/z/t2ifR7) you can see why -O2 lacks a bit in the speed department:
-O2: Writes one byte element at a time
-O3: Writes a qword worth of elements at a time (similar to what an optimized memset() would do)
Yes that’s a GCC bug – GCC9.2 still gets it wrong, but it works as expected on GCC10 trunk. It’s never a good idea to emit a simple byte loop.
I have trouble reproducing the results. I am using Ubuntu 18.04, Piledriver processor, GNU gcc 8.3.1.
$ ./alloc
131072 pages 512 MB new_and_touch 216.895 ms 2.31 GB/s
131072 pages 512 MB memset_existing_allocation 91.879 ms 5.44 GB/s
$ sudo ./run_with_huge_pages.sh ./alloc
131072 pages 512 MB new_and_touch 107.466 ms 4.65 GB/s
131072 pages 512 MB memset_existing_allocation 94.135 ms 5.31 GB/s
I got similar results (i.e. no speed increase) with gcc 7.4.0, and on a Kaby Lake with Fedora 29 and gcc 8.3.1. Could you please help me out on this?
I don’t think it is possible for memset to be that slow. Something is off in your results.