Daniel Lemire's blog

, 14 min read

How fast can you allocate a large block of memory in C++?

22 thoughts on “How fast can you allocate a large block of memory in C++?”

  1. Jeffrey W. Baker says:

    What about an actual huge page allocated with mmap instead of operator new? And what’s the allocator under test in your table?

    1. And what’s the allocator under test in your table?

      I am using GNU GCC 8.3 under Ubuntu 16.04.6.

  2. Wayne Scott says:

    In theory, a malloc library could cheat for calloc() and allocate the same read-only zero 4k page for every page in the region. And then still defer allocation for the first real write.

    1. Gil says:
      1. I am have a mac, and if it did what I think Wayne means, it could achieve seemingly impossible speeds on my benchmark… yet it is no faster than my Linux box.

        1. Wayne Scott says:

          That is why I said calloc() instead of a C++ constructor for a char. I am not surprised that libstd++ doesn’t specialize initialization to call calloc() in this case.

    2. Yes, it could. My benchmark should be viewed as a lower bound. It is yet possible that the system is cheating in all sorts of fun ways.

  3. Franek Korta says:

    Possibly it can be faster if allocation is done normal way, without initialisation, but later you just “touch” each page (4K or 2M). Touching can also be parallelised which improves performance (tested that some time ago)

    1. You are correct, allocating and then touching is faster in my tests.

  4. Anon says:

    It is much more basic to allocate memory on the stack instead of the heap. I would expect it to be much faster to allocate since you are just bumping a stack pointer. You will have to increase the maximum stack size though.

    1. jbn says:

      good luck allocating 512MB on the stack (I doubt it would be allowed OOTB on Linux, but i’d love to be proven wrong…) !

  5. Dato says:

    These are two other possibilities that you could have tested (I was expecting them, in fact), and that come closest to each other:

    // malloc + memset (7.7 GB/s)
    char *buf1 = (char*)malloc(s);
    memset(buf1, 0, s);

    and:

    // new char[s] + memset (9.4 GB/s)
    char *buf1 = new char[s];
    memset(buf1, 0, s);

    (It is difficult to outperform calloc because the zeroing will be done by the kernel, I guess.)

  6. A Panicek says:

    Is this a reason why redis recommends to turn off THP?

  7. Oliver Schönrock says:
  8. Ben says:

    Why not interleave allocation and initialization ?
    E.g. you request a large block, but the allocator call (you’d need to write a custom allocator) would block on the first page, not the entire request.
    E.g. in the background you would use the overcommit usage of the default allocator, then start async init of the pages and return access as each is completed.

    Alternatively if you know your memory usage pattern an arena allocator could work faster here for you (think zero on delete).

    1. @Ben

      The pages have to come from the operating system. If you are getting pages at 3 GB/s and your program needs 30 GB of memory, it is going to take 10 seconds. Writing a custom allocator is not going to solve this problem.

      1. Ben says:

        I understand, I think I misunderstood the way the OS releases the pages.

  9. rsaxvc says:

    You might wonder why the memory needs to be initialized…

    Linux supports page allocation without erasing previous contents as a kernel config somewhere. It would be interesting to compare allocation speed with and without it.

    1. rsaxvc says:
  10. Travis Downs says:

    You are probably not getting huge pages, or few of them.

    To get hugepages you have go jump through more hoops, e.g., allocate to a 2 MiB boundary, and do madvise on the memory before touching it. You can use mmap directly or one of the aligned allocators.

    Better, you can check to see if you got huge pages: I wrote page-info to do that, integrating it is fairly easy (you can find integrations in some of our shared projects, including how to jump through the aforementioned hoops). Note that it only works on Linux.

  11. Yurii Hordiienko says:

    char *buf = new chars;

    For this and similar, like std::string* p = new std::string10, compiler just generates a code to call memset for whole area before of actual construction (ctor call) each object.

    1. Maybe you refer to this C++ construction (link to the source code accompanying the post):

      https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/9f197e799e3481b523f1c13943d54c2bdccb1881/2020/01/14/alloc.cpp#L124