, 2 min read
On feeding your CPU with data
Can you guess the speed difference between these two lines of code? The first line of code does N additions:
for (int i=0; i<N;i++) sum+=arr[i];
The second line of code does N/16 additions:
for (int i=0; i<N;i+=16) sum+=arr[i];
A naive programmer might expect the second option to be 16 times faster. The actual answer is much more complicated and worth further study. I have implemented a simple test. I used the GNU GCC 4.5 compiler with the -O2 optimization flag. We can compute the complete sum faster using different compiler flags (-O3 -funroll-loops) and I present these results in a separate column (sum-unroll). In this last version, the compiler makes aggressive use of SSE instructions to vectorize the problem.
N | sum | sum-unroll | 1/16 |
---|---|---|---|
20K | 1100 | 6700 | 20,000 |
400K | 1000 | 3700 | 5100 |
500M | 2100 | 3900 | 4200 |
The speeds are expressed in millions of integers per second. On tiny arrays, most of the data resides close to the CPU. Hence, computations are essentially CPU bound: doing fewer computations means more speed. The story with large arrays is different. There, skipping almost all of the data (15 integers out of 16) only buys you twice the speed! Moreover, once we take into account the vectorized version that the compiler produced (sum-unroll), the difference becomes almost insignificant! My source code is available as usual.
Conclusion: We are in the big data era. Maybe ironically, I sometimes get the impression that our processors are drinking data out of a straw. Whereas a speed of 20,000 million integers per second is possible when the data is cached, I barely surpass 4000 million integers per second when reading from RAM.
Source: I stole the problem from Elazar Leibovich who posted it privately on Google+.