Daniel Lemire's blog

, 5 min read

Is software prefetching (__builtin_prefetch) useful for performance?

Many software performance problems have to do with data access. You could have the most powerful processor in the world, if the data is not available at the right time, the computation will be delayed.

It is intuitive. We used to locate libraries close to universities. In fact, universities were built around libraries. We also aggregate smart and knowledgeable people together.

Sadly, our computers are organized with processors on one side and memory on the other. To make matters worse, memory chips are simply unable to keep up with the processing power of processors. Thus we introduce caches. These are fast memory stores that are stuck to the processor. We have a whole hierarchy of CPU caches.

Happily, processor vendors like Intel have invested a lot of effort in making sure that the right data is made available at the right time. Thus if you access data according to a predictable pattern the processor will do a good job. Also, the processor can reorder its instructions and execute a load instruction earlier if it is useful. Simply put, a lot of engineering went to make sure that data-access problems are minimized. You might think that, as a programmer, you can do better… but truth is, most times you couldn’t. At least, most of us, most of the time, couldn’t.

You can still “blind” your processor. Some expensive instructions like the division will apparently prevent processors from fetching the data in time. At the very least, they can prevent some instruction reordering. Intuitively, the processor is so busy processing a monstrous instruction that it has no time to see what is coming next.

You can also thoroughly confuse the processor by doing many different things in an interleaved manner.

In a comment to one of my earlier blog post, someone suggested that I could improve the performance of my code by using software prefetches. That is, processor vendors like Intel provide us with instructions that can serve as hints to the processor… effectively telling the processor ahead of time that some data will be needed. Using the GCC and Clang compiler, you can invoke the __builtin_prefetch intrinsic function to get this effect. I do not know if other compilers, like that of Visual Studio, have this function.

The instructions generated by this function have a cost of their own, so they can definitively make your software slower. Still, when inserted in just the right manner in the right code, they can improve the performance.

Should you be using __builtin_prefetch? My experience says that you probably shouldn’t. I don’t like hard rules, but I have not yet seen a scenario where they are obviously a good idea. That is, I claim that if software prefetching is useful, that’s probably because your code could use some work.

I asked the commenter to provide an example where software prefetching was a good idea. He offered a scenario where we sum the values of bytes within an array…

for (int j = 0; j < ...; j++) {
    counter += bytes[p];
    p = (p + 513) & (NUMBYTES - 1);
}

This code, by itself, will not benefit from software prefetching. However, the commenter interleaved the sums…

for (int j = 0; j < ...; j++) {
  for(int i = 0; i < batchsize; i++) {
    counter[i] += bytes[p[i]];
    p[i] = (p[i] + 513) & (NUMBYTES - 1);
  }
}

If you pick batchsize large enough, you can confuse the processor and get some benefits from using the __builtin_prefetch function. I suppose that it is because the access pattern looks too unpredictable to the processor.

Let us look at the numbers I get…

sequential sums 3 s
interleaved sums 4.4 s
interleaved sums with __builtin_prefetch 4.0 s

The prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one. The resulting code will be simpler, easier to debug, more modular and faster.

To provide evidence that __builtin_prefetch is useful, you need to take code, optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance.

Software prefetching might help when the hardware prefetcher fails you. The hardware prefetecher will definitively fail you when it comes to “prefetching memory pages” but this requires a single prefetch per page and pages are never smaller than 4kB on modern hardware. You would recognize this scenario because prefetch instructions would only start helping when your data fits in many memory page, thus stressing the processor. If software prefetching helps in this case, it will mostly help when you are prefetching far ahead and in such cases, you should use very few prefetch instructions since a single software prefetch is all that is needed to load a page and you might want to avoid prefetching outside of your allocated memory region as it will waste ressources (a prefetch is not free). Furthermore, if page loading is your problem, then you should probably switch to large or huge pages as it will the whole problem go away entirely: linux has transparent huge pages which can be used for this purpose. If you think that it is your scenario, try your application with transparent huge pages under Linux… it should at least help you diagnose the problem. The hardware prefetcher might also fail you if your access pattern is unrecognized. Otherwise, the hardware prefetcher should, generally, give you near optimal performance. It is generally a well optimized component of modern processors. What if the hardware prefetcher cannot do its job, maybe because you have an unpredictable access pattern? The processor can issue several memory loads at once, but this number can be limited. On some Intel hardware, it might only be 10 memory loads in flight at the same time. This is generally a limited ressource, so, depending on the processor, you might want to organize your code so that you can always have 10 (or whatever the number might be) memory loads in flight at all time. Note that software prefetchers have no additional ressources so they won’t boost the number of requests in flight if you are already at the maximum. Sometimes, it can be difficult to reorganize the code to keep a high number of instructions in flight, so software prefetches can help, but one should first determine that it is not possible to rewrite the code to take into account memory-level parallelism.

My source code is available.

Credit: I’d like to thank Simon Hardy-Francis for his code contribution.