, 41 min read
Is software prefetching (__builtin_prefetch) useful for performance?
43 thoughts on “Is software prefetching (__builtin_prefetch) useful for performance?”
, 41 min read
43 thoughts on “Is software prefetching (__builtin_prefetch) useful for performance?”
There are lots of reasons software prefetch has become far less useful than in the past, to the point of being practically unnecessary. Some of the key ones are:
Cache hierarchy has improved dramatically in the last decade. We now have 2x64KB L1 caches, 512KB L2 caches and 4MB L3 caches with very low latencies in mobile phones! So there is a lot more data in caches and we can access it very fast.
Deep out-of-order execution makes a lot of prefetching redundant since it is already fetching data for the next few iterations of a loop before even finishing the current one. Modern CPUs support a huge number of outstanding cachemisses, around 20-50.
Modern CPU prefetchers are good. Really good. [1] Like branch predictors they can recognize and store complex patterns and adapt dynamically which is impossible for humans to match (much in the same way out-of-order execution beats manual scheduling without contest).
[1] https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/5
I love your smart comment, thanks!
Modern CPUs support a huge number of outstanding cachemisses, around 20-50.
Not on a per-core basis in all commodity processors though certainly true for whole CPUs (all cores).
Page 6 of that AnandTech article shows how modern Arm processors have a measurable MLP of 20 per core (note you get a mention!) That’s not accounting for prefetching or adding L2, L3 and DRAM MLP.
Total per-core capacity for Cortex-A76 is 46 outstanding misses in L2, plus 94 in L3: https://www.anandtech.com/show/12785/arm-cortex-a76-cpu-unveiled-7nm-powerhouse/3
Yes and 20 outstanding requests is pretty damn cool.
I know I’m (more than) a little late here, but I just spent some time with this while thinking about a portable implementation of _mm_prefetch for SIMDe, so…
I do not know if other compilers, like that of Visual Studio, have this function.
ICC supports __builtin_prefetch. Many other compilers don’t, but do have similar functionality.
VS doesn’t make it easy by having one function, but on x86 there is _mm_prefetch from SSE. On ARM VS has a __prefetch(const void*), and on ARM64 it also has a __prefetch2(const void *, uint8_t prfop).
ARM C Language Extensions (ACLE) has __pld and, in 1.1+, __pldx. VS is the only ARM compiler I’m aware of targeting ARM which doesn’t support ACLE.
Oracle Developer Studio has sun_prefetch_read_once/sun_prefetch_read_many/sun_prefetch_write_once/sun_prefetch_read_many, which work on x86/x86_64 and SPARC.
IBM XL C/C++ has __prefetch_by_load and __prefetch_by_stream. Newer clang-based versions also probably support __builtin_prefetch, but I haven’t verified that.
Cray has a #pragma _CRI prefetch.
PGI has a #pragma mem prefetch.
If anyone wants to play with them I put together a macro for Hedley which should be useful.
Thank you for the outstanding comment.
Hi Daniel,
I know this post is quite old, but I had a question I haven’t seen many answers for online: we always talk about prefetching data, but what about prefetching instructions?
Aarch64 has PRFM PLI* which allows preloading of an address that is expected to be an instruction (or some number of them). The address being jumped to may not be predictable by the CPU, say when using a jump table, or for functional languages like Haskell where all (non-FFI) function calls are just jumps. This code often looks like:
calculate the address
load instructions into ABI defined registers
branch to the address
handle the returned value(s)
since in many cases the first step involves putting the address into a register, the instruction decoder doesn’t have the information to see the target of a branch, it can’t predict where the branch will go.
If however after the first step we issue a prefetch of the calculated address, then performed the loading of arguments into argument registers, we’d get a head start on making sure the code we’re about to jump to is sitting near the CPU essentially for free.
Have you looked into anything like this, or know of any resources on the topic? I was surprised to find so little online discussing it (even people saying its a bad idea). I guess I haven’t found out where the optimisation obsessed hang out on the internet yet.
Love your work, please keep it up!
Also, check these.
https://lwn.net/Articles/444336/ (The problem with prefetch)
https://lwn.net/Articles/444344/ (Software prefetching considered harmful)
There are some papers that claim prefetch improves mark-sweep GC pause times.
The OCaml and GHC Haskell compilers both use a small queue (4-8 addresses IIRC) for the next few objects they know they are going to traverse, and prefetch each object when placing the address in the queue. I believe this showed quite large performance improvements, as the memory access pattern isn’t very predictable, but it is known that each address will need to be fetched.
IMO software prefetch made sense back when compilers were expected to make up for cpu deficiencies, but cpus are now such “brainiacs” that successful software prefetch is a way to spot cpu performance bugs.
But: Stockfish (one of the best chess programs) uses prefetch for one of its hash table lookups: https://github.com/official-stockfish/Stockfish/search?utf8=%E2%9C%93&q=prefetch&type=
Stockfish is a good example. It even uses some extra code to pre-generate the hashkey of the next move (before it is made) to calculate the prefetch-address. After the prefetch, it does some more work on the current node, then makes the next move on it’s internal board (updating/calculating the hashkey again) and only then accesses the hashtable with a low hash-miss-rate.
The prefetch can increase the overall speed of a chessprogram by 2-5% (and maybe only 1-2% without pre-calculated hashkey).
This is just the “sufficiently smart compiler” argument, but applied to hardware instead of compilers…
Not being an all-seeing oracle isn’t a “bug”. Despite lazy programmers’ desire to “leave it to the hardware”, manual optimization will always be a thing.
Interesting topic. prefetch is also available on MSVC as _mm_pause.
Anecdotally, I am often disappointed by the results, and it’s worrisome that the required lookahead distance is so system-specific (perhaps can be auto-tuned).
That said, we’ve seen a few speedups, e.g. 1.1x in a memory-bound checksum. It may also help to use regular loads (ensuring compiler doesn’t optimize them out) instead of prefetches.
Generally agree with your recommendation to just optimize for the hardware prefetcher.
This is a topic that I would like to better understand, as this can influence how we try to optimise cache usage, as cache appears to be important for effective use of vector instructions.
Does L1 cache store data or memory pages (ie block of memory, say 64kb) ? A similar question for L2 and L3 cache. If cache stores active memory pages, then I think the discussion needs to address this, as a “fetch” would be for a memory page, while your loop example considers variables.
The other unknown is for multi-thread coding, how is data shared between processor caches, especially if this data is changing.
If cache stores “blocks” of memory, I think this should be more significant for pre-fetch than a counter loop.
I would very much appreciate if you can explain this relationship.
The cache works at the cache line granularity (64 bytes). Paging is a distinct issue.
My apologies, as my previous post should have been clearer when referring to “processor”, “core” and “thread” and how L1, L2 & L3 cache relate to these … A supplementary question !
On current Intel processors, I think L3 is always shared between the cores. L1 and L2 are on a per core basis. This will definitively be different in other types of processors.
A core can run 1 or more threads.
If you have many cores doing prefetches, they’ll compete for cache space and for memory requests.
I find this a very interesting topic.
My apologies if I am redirecting this post idea, but I don’t think a prefetch instruction sufficiently explains the performance and especially, how to influence improvement.
Processor data storage is described as a hierarchy of data storage types:
• Memory
• L3 cache
• L2 cache
• L1 cache
• 32-bit registers
• 64-bit registers
• Vector registers, which can be 128-bit, 256-bit or 512-bit
All these have different group sizes.
Where does a prefetch instruction fit into this mix?
When optimising data availability to vector registers, I expect this must depend on how data is moved through memory and the 3 levels of cache, in both directions.
As I said initially, I don’t understand how this works, but a better understanding might help when devising coding strategies to achieve improvement. I have achieved huge variations in performance of vector instructions, often without a clear understanding as to why.
I do think it is fortunate that we can’t interfere directly in how this is done, as I’m sure we would do a worse job.
Hello, Daniel. I disagree with the points in this article so I made an example with regular loop that becomes faster with a prefetch. The answer is long so I made a post in my blog.
Quoting from your article…
I am not sure we disagree in the end?
I made an example where we can speedup summing one by one with prefetch. My first point is that prefetch should be a compilers’s optimization, not by hand. And the other point that regular memory access is not good a example for prefetch. Later I will write about cases where it is really good and can not be improved by code refactoring.
But how would the compiler know (a) that the access pattern causes the auto prefech to be less effective, and (b) that the memory accessed is large enough to warrant the use of the prefech instruction? Not to mention (c) that batch access with prefetch is more effective… would you imagine the compiler reorganizing non-batched code into batched code?
As the person who came up with this simple example — and donated part of his Saturday morning — at the request of Daniel Lemire… I’m surprised I didn’t even receive a ‘thanks’.
Instead the ‘algorithm’ is criticized “prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one.” But I don’t think it’s possible to get a good example with good algorithm in only a few dozen lines of code. What was requested: “a short C program (say 30 lines of code) where __builtin_prefetch() helps”. So this criticism seems somewhat unfair.
Also, in my tests [1] I noticed a performance gain of up to 40% with a batch size of 64… which is way bigger than the 10% improvement reported in the blog post.
The test code is designed to show the effects of randomly accessing non-cached RAM using different batch sizes. It maybe that accessing RAM with a smaller batch size is generally faster — as Daniel Lemire points out — however, IMO that’s missing the point. The point is that although the test code shows a possible speed improvement between 10% and 40%, using prefetch allows the developer to execute other code while the prefetches are happening. And this can be A LOT of other code if it works on RAM which is already cached. Consider the fact that the same test code works nearly 14 times faster if operating on already cached RAM. That’s a huge amount of instructions that could be operating while waiting for the prefetch instructions which operate asynchronously to other instructions. So the pattern is (a) prefetch in a batch, (b) do some task on purely cached RAM, and (c) finally use the prefetched RAM.
Here’s the problem: You obviously will never run into an ‘abc’ situation by optimizing as Daniel Lemire suggests: “optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance”. Instead you need to refactor the algorithm around prefetching to truly take advantage 🙂
[1] https://gist.github.com/simonhf/caaa33ccb87c0bf0775a863c0d6843c2
As the person who came up with this simple example — and donated part of his Saturday morning — at the request of Daniel Lemire… I’m surprised I didn’t even receive a ‘thanks’.
I sent you an email privately just now. I should have emailed you privately to discuss my upcoming blog post, I did not. For that, I apologize. The source code I link to refers to your original gist, but my code is substantially different: I kept the main idea, but I simplified the code for the purpose of producing a simple example. I appreciate your work. I kept your name out of the blog post because I did not want to personify the argument and because I am doing a different benchmark. I should have asked you whether you wanted me to use your name.
I have appended a credit section to my blog post where I thank you for the code contribution.
I’m sorry if I offended you. That was not my intention.
But I don’t think it’s possible to get a good example with good algorithm in only a few dozen lines of code.
It could be that software prefetching becomes very useful in complex code, even if it is not useful in simpler code.
Also, in my tests I noticed a performance gain of up to 40% with a batch size of 64… which is way bigger than the 10% improvement reported in the blog post.
My numbers are not very sensitive to the batch size beyond a certain point… 32, 64… it is all the same. I tried the pick the parameters so that the benefits of the prefetching are best. If you find better parameters, I’ll be glad to update my blog post.
My code is different from your code, which is partly why I did not name you in the post. I am using a simplified version of your code.
Consider the fact that the same test code works nearly 14 times faster if operating on already cached RAM. That’s a huge amount of instructions that could be operating while waiting for the prefetch instructions which operate asynchronously to other instructions.
It shows that we can be “memory bound”. However, my argument is not whether we should prefetch the data or not… you should obviously prefetch it… my argument has to do with whether you need software prefetching. I think you do not. My claim is that you can almost always rewrite your code without software prefetches in such a way that it will be at least as fast.
I’m willing to admit that there might be cases where software prefetching is useful, but I think that they are uncommon.
So the pattern is (a) prefetch in a batch, (b) do some task on purely cached RAM, and (c) finally use the prefetched RAM. Here’s the problem: You obviously will never run into an ‘abc’ situation by optimizing as Daniel Lemire suggests: “optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performanceâ€. Instead you need to refactor the algorithm around prefetching to truly take advantage 🙂
Your approach might well be practically sound, but I submit to you that you take it as an axiom that you need the software prefetching. This can work well when programming… but I am trying to determine whether the software prefetching is needed at all.
Hello, Daniel.
Is AVX512PF useful? It provides prefetching for random access in gather(). I wonder the improvement of AVX512PF, but I don’t have a platform to test it.
Interesting. Do you know if Intel makes processors supporting AVX512PF at this point in time?
details from https://en.wikipedia.org/wiki/AVX-512
Xeon Phi x200 (Knights Landing):[1][9] AVX-512 F, CD, ER, PF
Xeon Phi Knights Mill:[7] AVX-512 F, CD, ER, PF, 4FMAPS, 4VNNIW, VPOPCNTDQ
Skylake-SP, Skylake-X:[10][11][12] AVX-512 F, CD, VL, DQ, BW
Cannonlake:[7] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI
Ice Lake:[7] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES
So it is Xeon Phi-specific. I am less interested in Xeon Phi processors at this time, even though I own one such processor.
If you have code you’d like to be tested, I can run the tests for you.
FWIW I’ve seen these kind of prefetching directives used to accelerate memcpy implementations on ARM. E.g. https://android.googlesource.com/platform/bionic/+/199f9d923804d74e021dd80e48ec75c0a96dba77/libc/arch-arm/bionic/memcpy.S#50
Hi Prof. Lemire,
allow me to share one nifty etude showing that software prefetching actually boosts a REALWORLD matrix abracadabra, done by me.
Since I am fond of benchmarking I would love to see other runs on faster machines with at least 8 cores, my only testmachine ‘Compressionette’ is with 2cores/4threads i5-7200u:
https://www.overclock.net/forum/21-benchmarking-software-discussion/1678401-mike-vs-mickey-plagiarism-128-threaded-benchmark.html#post27625690
My greediness tells me the ‘Mike vs Mickey’ benchmark should go beyond 43GB/s (L3, not uncached RAM, dominates), easy, for dual channel sustems, for some reasons current record is 28.4 GB/s.
In short, without prefetching the performance is ~12 billions cells per second, with prefetching – clear 1 billion more. The main loop of one of the threads:
[/*; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:CORE-AVX2 -openmp -FAcs -DKamYMM -D_N_HIGH_PRIORITY";
.B1.118::
00a30 0f 18 89 ff 0f
00 00 prefetcht0 BYTE PTR \[4095+rcx\]
00a37 49 ff c3 inc r11
00a3a 41 0f 18 8c 30
00 10 00 00 prefetcht0 BYTE PTR \[4096+r8+rsi\]
00a43 c4 c1 1d 74 04
30 vpcmpeqb ymm0, ymm12, YMMWORD PTR \[r8+rsi\]
00a49 c5 fd db 49 ff vpand ymm1, ymm0, YMMWORD PTR \[-1+rcx\]
00a4e 48 83 c1 20 add rcx, 32
00a52 c5 dd f8 d0 vpsubb ymm2, ymm4, ymm0
00a56 c5 f5 fc da vpaddb ymm3, ymm1, ymm2
00a5a c4 81 7e 7f 1c
08 vmovdqu YMMWORD PTR \[r8+r9\], ymm3
00a60 49 83 c0 20 add r8, 32
00a64 c5 25 de db vpmaxub ymm11, ymm11, ymm3
00a68 4c 3b da cmp r11, rdx
00a6b 72 c3 jb .B1.118
*/][1]
Hope, someone helps me in boosting the etude even farther, big numbers galdden my eyes.
Hi Prof. Lemire,
allow me to share one nifty etude showing that software prefetching actually boosts a REALWORLD matrix abracadabra, done by me.
Since I am fond of benchmarking I would love to see other runs on faster machines with at least 8 cores, my only testmachine ‘Compressionette’ is with 2cores/4threads i5-7200u:
URL
My greediness tells me the ‘Mike vs Mickey’ benchmark should go beyond 43GB/s (L3, not uncached RAM, dominates), easy, for dual channel sustems, for some reasons current record is 28.4 GB/s.
In short, without prefetching the performance is ~12 billions cells per second, with prefetching – clear 1 billion more. The main loop:
/*; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:CORE-AVX2 -openmp -FAcs -DKamYMM -D_N_HIGH_PRIORITY";
.B1.118::
00a30 0f 18 89 ff 0f
00 00 prefetcht0 BYTE PTR [4095+rcx]
00a37 49 ff c3 inc r11
00a3a 41 0f 18 8c 30
00 10 00 00 prefetcht0 BYTE PTR [4096+r8+rsi]
00a43 c4 c1 1d 74 04
30 vpcmpeqb ymm0, ymm12, YMMWORD PTR [r8+rsi]
00a49 c5 fd db 49 ff vpand ymm1, ymm0, YMMWORD PTR [-1+rcx]
00a4e 48 83 c1 20 add rcx, 32
00a52 c5 dd f8 d0 vpsubb ymm2, ymm4, ymm0
00a56 c5 f5 fc da vpaddb ymm3, ymm1, ymm2
00a5a c4 81 7e 7f 1c
08 vmovdqu YMMWORD PTR [r8+r9], ymm3
00a60 49 83 c0 20 add r8, 32
00a64 c5 25 de db vpmaxub ymm11, ymm11, ymm3
00a68 4c 3b da cmp r11, rdx
00a6b 72 c3 jb .B1.118
*/
One of the 8 threads:
__m256i Innerloop2YMM (uint64_t ChunkToTraverseL, uint64_t ChunkToTraverseR, uint8_t *Matrix_vectorPrev, uint8_t *Matrix_vectorCurr, uint8_t *workK, __m256i YMMclone)
{
__m256i YMMprev, YMMcurr;
__m256i YMMmax = _mm256_set1_epi8(0);
__m256i YMMzero = _mm256_set1_epi8(0);
__m256i YMMsub, YMMcmp, YMMand, YMMadd;
uint64_t j;
for(j=ChunkToTraverseL; j < ChunkToTraverseR; j+=(32/1)){
#ifdef _N_ALIGNED
YMMprev = _mm256_load_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
YMMcurr = _mm256_load_si256((__m256i*)&workK[j]);
#else
YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
_mm_prefetch((char*)(Matrix_vectorPrev+(j-1) + 64*64), _MM_HINT_T0);
_mm_prefetch((char*)(&workK[j] + 64*64), _MM_HINT_T0);
#endif
YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);
YMMand = _mm256_and_si256(YMMprev, YMMcmp);
YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
YMMadd = _mm256_add_epi8(YMMand, YMMsub);
_mm256_storeu_si256((__m256i*)(Matrix_vectorCurr+j), YMMadd);
YMMmax = _mm256_max_epu8(YMMmax, YMMadd);
}
return YMMmax;
}
Hope, someone helps me in boosting the etude even farther, big numbers gladden my eyes.
I work in a specific domain of computer science where prefetching is a basic tool: network traffic processing in routers that have requirement of Gigabits per second per core.
The problem with the network traffic processing is that the traffic comes from so many users, that it´s completely unpredictable, from compiler point of view, which data is going to be accessed. Basically, all data lookup is always cold.
For instance, a very basic operation is to lookup the 5-tuple of a packet. That will be always cold. If you are able to compute the 5-tuple and prefecth while you continue doing other work, then you will save one data cache miss. At Gbps, one data cache miss matters.
So at least in my field, it’s very important to handle with care the data cache misses.
it’s very important to handle with care the data cache misses
That’s true in software generally. My point in this post is not that cache misses are irrelevant, but rather that if you write your code with care, you do not need explicit prefetching.
Thanks Daniel.
I think I understood your point, but the thing is that the memory is cold in HasthTbl(IP.address) for sure. The compiler will not be able to prefecth it – how would it? The compiler is not smart enough to compute the 5tuple of a packet (IP addresses + ports + TCP/UDP) and do a flow lookup on its own. You need to do that prefetching manually. No amount of well-written code will avoid this case 🙁
The prefetch is also not magic. You need to compute the hash, access the data and bring it to cache. In effect, you need to do exactly the same thing whether you use a prefetch or not. Of course, the prefetch won’t populate a register, and it plays differently in the instruction pipeline, but it is not free. It has a cost of its own.
Yes, it has a cost. Unfortunately, I can’t share the code, but I can promise you that a set of three smartly put prefetch increased a 5% the throughput that a CPU could handle, according to our tests.
I am sorry for not sharing the code 🙁
I think 5% is credible.
a 5% per CPU, with 32 CPUs, it’s more than one extra CPU 😉
Yes, but how confident are you that if I could get my hands on your code without explicit prefetch, I would be unable to optimize its memory accesses?
That is, are you certainly that your prefetch-free code is absolutely, without question, as fast as it can be?
Maybe you are, but people frequently underestimate how a little bit of code rewrite can go a long way.
I am sure it’s not the best code by any means. I am sure it can be improved, and we are continously working on it. The critical path of the application has been rewritten many times.
However, I am also confident that the cold accesses (hash table for tuple lookup, for instance), are unavoidable and only a prefetch can help there.
It is clear why out-of-cache access is unavoidable, but what is your rationale for “only a prefetch can help”? I understand that you can’t share the code, but can you elaborate, at a high level, on the algorithm involved. What do you do with the value found in the hash table?
I am not arguing that you are wrong. I am just trying to understand your point better.