14th January 2020, 14 min read

How fast can you allocate a large block of memory in C++?

22 thoughts on “How fast can you allocate a large block of memory in C++?”

Jeffrey W. Baker says:

January 14, 2020 at 4:58 pm

What about an actual huge page allocated with mmap instead of operator new? And what’s the allocator under test in your table?
1. Daniel Lemire says:
  
  January 14, 2020 at 5:04 pm
  
  And what’s the allocator under test in your table?
  
  I am using GNU GCC 8.3 under Ubuntu 16.04.6.
Wayne Scott says:

January 14, 2020 at 5:14 pm

In theory, a malloc library could cheat for calloc() and allocate the same read-only zero 4k page for every page in the region. And then still defer allocation for the first real write.
1. Gil says:
  
  January 14, 2020 at 5:41 pm
  
  Not a theory. OS X has deferred zeroing each page as long as I can remember: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html
  1. Daniel Lemire says:
    
    January 14, 2020 at 6:35 pm
    
    I am have a mac, and if it did what I think Wayne means, it could achieve seemingly impossible speeds on my benchmark… yet it is no faster than my Linux box.
    1. Wayne Scott says:
      
      January 14, 2020 at 6:50 pm
      
      That is why I said calloc() instead of a C++ constructor for a char. I am not surprised that libstd++ doesn’t specialize initialization to call calloc() in this case.
2. Daniel Lemire says:
  
  January 14, 2020 at 5:49 pm
  
  Yes, it could. My benchmark should be viewed as a lower bound. It is yet possible that the system is cheating in all sorts of fun ways.
Franek Korta says:

January 14, 2020 at 5:45 pm

Possibly it can be faster if allocation is done normal way, without initialisation, but later you just “touch” each page (4K or 2M). Touching can also be parallelised which improves performance (tested that some time ago)
1. Daniel Lemire says:
  
  January 14, 2020 at 5:59 pm
  
  You are correct, allocating and then touching is faster in my tests.
Anon says:

January 14, 2020 at 6:42 pm

It is much more basic to allocate memory on the stack instead of the heap. I would expect it to be much faster to allocate since you are just bumping a stack pointer. You will have to increase the maximum stack size though.
1. jbn says:
  
  January 14, 2020 at 9:56 pm
  
  good luck allocating 512MB on the stack (I doubt it would be allowed OOTB on Linux, but i’d love to be proven wrong…) !
Dato says:

January 14, 2020 at 10:09 pm

These are two other possibilities that you could have tested (I was expecting them, in fact), and that come closest to each other:

// malloc + memset (7.7 GB/s) char *buf1 = (char*)malloc(s); memset(buf1, 0, s);

and:

// new char[s] + memset (9.4 GB/s) char *buf1 = new char[s]; memset(buf1, 0, s);

(It is difficult to outperform calloc because the zeroing will be done by the kernel, I guess.)
A Panicek says:

January 14, 2020 at 11:49 pm

Is this a reason why redis recommends to turn off THP?
Oliver Schönrock says:

January 15, 2020 at 12:11 pm

cleaned code via github pull request and comments here:

https://www.reddit.com/r/cpp/comments/eoq6ly/how_fast_can_you_allocate_a_large_block_of_memory/fegce62/
Ben says:

January 15, 2020 at 5:14 pm

Why not interleave allocation and initialization ?
E.g. you request a large block, but the allocator call (you’d need to write a custom allocator) would block on the first page, not the entire request.
E.g. in the background you would use the overcommit usage of the default allocator, then start async init of the pages and return access as each is completed.

Alternatively if you know your memory usage pattern an arena allocator could work faster here for you (think zero on delete).
1. Daniel Lemire says:
  
  January 15, 2020 at 5:19 pm
  
  @Ben
  
  The pages have to come from the operating system. If you are getting pages at 3 GB/s and your program needs 30 GB of memory, it is going to take 10 seconds. Writing a custom allocator is not going to solve this problem.
  1. Ben says:
    
    January 15, 2020 at 5:23 pm
    
    I understand, I think I misunderstood the way the OS releases the pages.
rsaxvc says:

January 16, 2020 at 6:19 am

You might wonder why the memory needs to be initialized…

Linux supports page allocation without erasing previous contents as a kernel config somewhere. It would be interesting to compare allocation speed with and without it.
1. rsaxvc says:
  
  January 16, 2020 at 6:58 am
  
  Oh, it’s only for no-MMU systems:
  
  https://cateee.net/lkddb/web-lkddb/MMAP_ALLOW_UNINITIALIZED.html
Travis Downs says:

January 16, 2020 at 5:23 pm

You are probably not getting huge pages, or few of them.

To get hugepages you have go jump through more hoops, e.g., allocate to a 2 MiB boundary, and do madvise on the memory before touching it. You can use mmap directly or one of the aligned allocators.

Better, you can check to see if you got huge pages: I wrote page-info to do that, integrating it is fairly easy (you can find integrations in some of our shared projects, including how to jump through the aforementioned hoops). Note that it only works on Linux.
Yurii Hordiienko says:

July 3, 2021 at 5:33 pm

char *buf = new chars;

For this and similar, like std::string* p = new std::string10, compiler just generates a code to call memset for whole area before of actual construction (ctor call) each object.
1. Daniel Lemire says:
  
  July 3, 2021 at 5:49 pm
  
  Maybe you refer to this C++ construction (link to the source code accompanying the post):
  
  https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/9f197e799e3481b523f1c13943d54c2bdccb1881/2020/01/14/alloc.cpp#L124