Daniel Lemire's blog

, 2 min read

Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth.

Methodologically, I benchmarked the “new” operator in C++ with initialization, using the GNU GCC compiler with the -O2 optimization flag1.

char *buf = new char[s]();

It turns out that you can do better while sticking with C++. We cannot simply call the new operator without initialization because, in general, it does not result in the memory being actually allocated. However, we can allocate the memory and then make sure that we touch every “page” of memory. On modern Intel systems, pages are effectively always at least as large of 4kB, so we can touch the memory every 4kB. We might call this approach “new and touch”.

char * buf = new char[size];
for (size_t i = 0; i < size; i += 4096) buf[i] = 0;
buf[size - 1] = 0;

Such a new-and-touch strategy should be close to “bare-metal” memory allocation speeds. So how fast is it? It depends on the page size. By default, most systems rely on small (4kB) pages. Allocating many small pages is expensive. Thankfully, Linux can be configured to use huge pages, transparently, via a feature called “transparent huge pages”. And it makes a huge difference!

  Allocating 512MB Setting 512MB to zero
regular pages (4kB) 3.9 GB/s 30 GB/s
transparent huge pages 20 GB/s 30 GB/s

I am using a recent Linux system (Ubuntu 16.04), a Skylake processor and GNU GCC 8.3 with the -O2 optimization flag. My source code is available.

It is still the case that allocating memory on most systems is a non-trivial cost since they rely on small 4kB pages. There are fast disks available on the market that have more than 4 GB/s of bandwidth.

Credit: Thanks to Travis Downs and others for their insights and comments.

 

1 Downs found that we get far better performance out of the new operator with initialization under GNU GCC with the more agressive -O3 optimization flag. Performance-wise, it should be close to the “new and touch” approach that I am describing.