Daniel Lemire's blog

, 2 min read

On feeding your CPU with data

Can you guess the speed difference between these two lines of code? The first line of code does N additions:

for (int i=0; i<N;i++) sum+=arr[i];

The second line of code does N/16 additions:

for (int i=0; i<N;i+=16) sum+=arr[i];

A naive programmer might expect the second option to be 16 times faster. The actual answer is much more complicated and worth further study. I have implemented a simple test. I used the GNU GCC 4.5 compiler with the -O2 optimization flag. We can compute the complete sum faster using different compiler flags (-O3 -funroll-loops) and I present these results in a separate column (sum-unroll). In this last version, the compiler makes aggressive use of SSE instructions to vectorize the problem.

  N   sum sum-unroll 1/16
20K     1100     6700     20,000
400K     1000     3700     5100
500M     2100     3900     4200

The speeds are expressed in millions of integers per second. On tiny arrays, most of the data resides close to the CPU. Hence, computations are essentially CPU bound: doing fewer computations means more speed. The story with large arrays is different. There, skipping almost all of the data (15 integers out of 16) only buys you twice the speed! Moreover, once we take into account the vectorized version that the compiler produced (sum-unroll), the difference becomes almost insignificant! My source code is available as usual.

Conclusion: We are in the big data era. Maybe ironically, I sometimes get the impression that our processors are drinking data out of a straw. Whereas a speed of 20,000 million integers per second is possible when the data is cached, I barely surpass 4000 million integers per second when reading from RAM.

Source: I stole the problem from Elazar Leibovich who posted it privately on Google+.