And what’s the allocator under test in your table?
I am using GNU GCC 8.3 under Ubuntu 16.04.6.
Wayne Scottsays:
In theory, a malloc library could cheat for calloc() and allocate the same read-only zero 4k page for every page in the region. And then still defer allocation for the first real write.
I am have a mac, and if it did what I think Wayne means, it could achieve seemingly impossible speeds on my benchmark… yet it is no faster than my Linux box.
Wayne Scottsays:
That is why I said calloc() instead of a C++ constructor for a char. I am not surprised that libstd++ doesn’t specialize initialization to call calloc() in this case.
Possibly it can be faster if allocation is done normal way, without initialisation, but later you just “touch” each page (4K or 2M). Touching can also be parallelised which improves performance (tested that some time ago)
You are correct, allocating and then touching is faster in my tests.
Anonsays:
It is much more basic to allocate memory on the stack instead of the heap. I would expect it to be much faster to allocate since you are just bumping a stack pointer. You will have to increase the maximum stack size though.
jbnsays:
good luck allocating 512MB on the stack (I doubt it would be allowed OOTB on Linux, but i’d love to be proven wrong…) !
Datosays:
These are two other possibilities that you could have tested (I was expecting them, in fact), and that come closest to each other:
Why not interleave allocation and initialization ?
E.g. you request a large block, but the allocator call (you’d need to write a custom allocator) would block on the first page, not the entire request.
E.g. in the background you would use the overcommit usage of the default allocator, then start async init of the pages and return access as each is completed.
Alternatively if you know your memory usage pattern an arena allocator could work faster here for you (think zero on delete).
The pages have to come from the operating system. If you are getting pages at 3 GB/s and your program needs 30 GB of memory, it is going to take 10 seconds. Writing a custom allocator is not going to solve this problem.
Bensays:
I understand, I think I misunderstood the way the OS releases the pages.
You might wonder why the memory needs to be initialized…
Linux supports page allocation without erasing previous contents as a kernel config somewhere. It would be interesting to compare allocation speed with and without it.
You are probably not getting huge pages, or few of them.
To get hugepages you have go jump through more hoops, e.g., allocate to a 2 MiB boundary, and do madvise on the memory before touching it. You can use mmap directly or one of the aligned allocators.
Better, you can check to see if you got huge pages: I wrote page-info to do that, integrating it is fairly easy (you can find integrations in some of our shared projects, including how to jump through the aforementioned hoops). Note that it only works on Linux.
For this and similar, like std::string* p = new std::string10, compiler just generates a code to call memset for whole area before of actual construction (ctor call) each object.
What about an actual huge page allocated with mmap instead of operator new? And what’s the allocator under test in your table?
And what’s the allocator under test in your table?
I am using GNU GCC 8.3 under Ubuntu 16.04.6.
In theory, a malloc library could cheat for calloc() and allocate the same read-only zero 4k page for every page in the region. And then still defer allocation for the first real write.
Not a theory. OS X has deferred zeroing each page as long as I can remember: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html
I am have a mac, and if it did what I think Wayne means, it could achieve seemingly impossible speeds on my benchmark… yet it is no faster than my Linux box.
That is why I said calloc() instead of a C++ constructor for a char. I am not surprised that libstd++ doesn’t specialize initialization to call calloc() in this case.
Yes, it could. My benchmark should be viewed as a lower bound. It is yet possible that the system is cheating in all sorts of fun ways.
Possibly it can be faster if allocation is done normal way, without initialisation, but later you just “touch” each page (4K or 2M). Touching can also be parallelised which improves performance (tested that some time ago)
You are correct, allocating and then touching is faster in my tests.
It is much more basic to allocate memory on the stack instead of the heap. I would expect it to be much faster to allocate since you are just bumping a stack pointer. You will have to increase the maximum stack size though.
good luck allocating 512MB on the stack (I doubt it would be allowed OOTB on Linux, but i’d love to be proven wrong…) !
These are two other possibilities that you could have tested (I was expecting them, in fact), and that come closest to each other:
// malloc + memset (7.7 GB/s)
char *buf1 = (char*)malloc(s);
memset(buf1, 0, s);
and:
// new char[s] + memset (9.4 GB/s)
char *buf1 = new char[s];
memset(buf1, 0, s);
(It is difficult to outperform calloc because the zeroing will be done by the kernel, I guess.)
Is this a reason why redis recommends to turn off THP?
cleaned code via github pull request and comments here:
https://www.reddit.com/r/cpp/comments/eoq6ly/how_fast_can_you_allocate_a_large_block_of_memory/fegce62/
Why not interleave allocation and initialization ?
E.g. you request a large block, but the allocator call (you’d need to write a custom allocator) would block on the first page, not the entire request.
E.g. in the background you would use the overcommit usage of the default allocator, then start async init of the pages and return access as each is completed.
Alternatively if you know your memory usage pattern an arena allocator could work faster here for you (think zero on delete).
@Ben
The pages have to come from the operating system. If you are getting pages at 3 GB/s and your program needs 30 GB of memory, it is going to take 10 seconds. Writing a custom allocator is not going to solve this problem.
I understand, I think I misunderstood the way the OS releases the pages.
Linux supports page allocation without erasing previous contents as a kernel config somewhere. It would be interesting to compare allocation speed with and without it.
Oh, it’s only for no-MMU systems:
https://cateee.net/lkddb/web-lkddb/MMAP_ALLOW_UNINITIALIZED.html
You are probably not getting huge pages, or few of them.
To get hugepages you have go jump through more hoops, e.g., allocate to a 2 MiB boundary, and do madvise on the memory before touching it. You can use mmap directly or one of the aligned allocators.
Better, you can check to see if you got huge pages: I wrote page-info to do that, integrating it is fairly easy (you can find integrations in some of our shared projects, including how to jump through the aforementioned hoops). Note that it only works on Linux.
For this and similar, like std::string* p = new std::string10, compiler just generates a code to call memset for whole area before of actual construction (ctor call) each object.
Maybe you refer to this C++ construction (link to the source code accompanying the post):
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/9f197e799e3481b523f1c13943d54c2bdccb1881/2020/01/14/alloc.cpp#L124