I designed something similar but with the temporary buffers containing larger batches. Only the tail of the buffer stays in cache for appending. Once a buffer is above a water line, it is emptied into the main array with good locality. The hardware prefetcher helps with the sequential reads from the buffers.
This doesn’t scale to very large array: at one point, either the bucket size or the buffer tails won’t fit in the cache anymore. It make me wondering if this algorithm can be efficiently layered. Something like the funnelsort approach.
I tried using non temporal writes to the buffers, but for variable or sub 64B item sizes, it still needs some place to do the write combining, like the actual WCB buffers or yet another temporary buffer in cache. This doesn’t work so well once the number of buckets gets very large.
I will compare your code against my method once I get back to this project.
I’m also curious to see how it compares with prefetches and a ring delay. This random accesses situation might be the only valid use case for manual prefetches.
Alexander Monakovsays:
I am confused: if you know that THP setting matters here, why is rest of the text is focusing purely on cache misses? With such a huge array (4GB), address translation misses with 4K pages will likely be the dominating factor, and indeed a quick test with hugepages allowed shows that they bring a 2x-3x speedup (on Sandybridge).
As far as I can tell, page faults are not relevant for performance here, because they mostly occur on initial accesses to the array (i.e. in memset). Once address mapping is populated, subsequent accesses cause TLB misses that are resolved via page-table walks, without causing faults.
TLB misses do not necessarily imply cache misses, it’s possible to have data in cache that needs a page walk because the relevant TLB entry has been evicted.
I have the same comeback. A TLB is a cache. A page walk occurs following a cache miss (at the TLB level). I’d grant you that it is not how most people would understand the term, but maybe you can believe me that this is what I had in mind. To be clear, as your first comment implies, I am aware that page size is an important factor in this benchmark.
In case there is any doubt: I am not claiming (at all) that my solution (with or without playing with the page size) is best. It is definitively the case that you can improve it with large/huge pages (thanks for pointing it out).
genesays:
I would discuss buddy memory system and Linux kernel caching which may have used it at one point.
I designed something similar but with the temporary buffers containing larger batches. Only the tail of the buffer stays in cache for appending. Once a buffer is above a water line, it is emptied into the main array with good locality. The hardware prefetcher helps with the sequential reads from the buffers.
This doesn’t scale to very large array: at one point, either the bucket size or the buffer tails won’t fit in the cache anymore. It make me wondering if this algorithm can be efficiently layered. Something like the funnelsort approach.
I tried using non temporal writes to the buffers, but for variable or sub 64B item sizes, it still needs some place to do the write combining, like the actual WCB buffers or yet another temporary buffer in cache. This doesn’t work so well once the number of buckets gets very large.
I will compare your code against my method once I get back to this project.
I’m also curious to see how it compares with prefetches and a ring delay. This random accesses situation might be the only valid use case for manual prefetches.
I am confused: if you know that THP setting matters here, why is rest of the text is focusing purely on cache misses? With such a huge array (4GB), address translation misses with 4K pages will likely be the dominating factor, and indeed a quick test with hugepages allowed shows that they bring a 2x-3x speedup (on Sandybridge).
Page faults occur as part of a cache miss.
As far as I can tell, page faults are not relevant for performance here, because they mostly occur on initial accesses to the array (i.e. in memset). Once address mapping is populated, subsequent accesses cause TLB misses that are resolved via page-table walks, without causing faults.
TLB misses do not necessarily imply cache misses, it’s possible to have data in cache that needs a page walk because the relevant TLB entry has been evicted.
I have the same comeback. A TLB is a cache. A page walk occurs following a cache miss (at the TLB level). I’d grant you that it is not how most people would understand the term, but maybe you can believe me that this is what I had in mind. To be clear, as your first comment implies, I am aware that page size is an important factor in this benchmark.
In case there is any doubt: I am not claiming (at all) that my solution (with or without playing with the page size) is best. It is definitively the case that you can improve it with large/huge pages (thanks for pointing it out).
I would discuss buddy memory system and Linux kernel caching which may have used it at one point.