12th April 2018, 14 min read

For greater speed, try batching your out-of-cache data accesses

14 thoughts on “For greater speed, try batching your out-of-cache data accesses”

Matthew Wozniczka says:

April 13, 2018 at 1:34 am

Are you missing a ‘- 1’ in your assert?
1. Daniel Lemire says:
  
  April 13, 2018 at 3:15 pm
  
  I corrected the typo.
alecco says:

April 13, 2018 at 12:52 pm

There is an interesting recent post by ryg on this topic: A whirlwind introduction to dataflow graphs
alecco says:

April 14, 2018 at 11:29 am

There must be something off here. First I thought it could be a difference in Hardware Prefetcher or Loop Optimizer. But that didn’t make sense when debugging. So I dived into the generated code for the sumrandom function first:

uint64_t sumrandom(uint64_t *values, size_t size) { uint64_t sum = 0; for (size_t k = 0; k < size; ++k) { sum += values[murmur32(k) % size ]; // dependency! } return sum; }

Checking the memory access dependency some nasty instruction pops up:

.L25: movl %ecx, %eax shrl $16, %eax xorl %ecx, %eax addq $1, %rcx imull $-2048144789, %eax, %eax movl %eax, %edx shrl $13, %edx xorl %edx, %eax imull $-1028477387, %eax, %eax movl %eax, %edx shrl $16, %edx xorl %edx, %eax xorl %edx, %edx divq %rsi # rdx:rax /= rsi EXPENSIVE addq (%rdi,%rdx,8), %r8 # r8 += rdi[ rdx * 8 ] cmpq %rcx, %rsi jne .L25

The latency cost of the hashing will be in the modulo for murmur32(k) % size as it requires a 64 bit division. Agner reports 21-83 cycles. If we make the size of the hash table a power of 2 the compiler can get the hash bits with a mask using andl instead of the expensive divq.

If we change

void demo() { size_t N = 1024 * 1024 * 16; // 16M instead of 13M

the divq goes away:

.L25: movl %ecx, %edx shrl $16, %edx xorl %ecx, %edx addq $1, %rcx imull $-2048144789, %edx, %edx movl %edx, %esi shrl $13, %esi xorl %esi, %edx imull $-1028477387, %edx, %edx movl %edx, %esi shrl $16, %esi xorl %esi, %edx andl $16777215, %edx # and to 0xFFFFFF, no divq addq (%rdi,%rdx,8), %rax cmpq $16777216, %rcx jne .L25

Comparing 13M table size bench:

cc -march=native -O3 -o batchload batchload.c [demo] N= 13631488 populaterandom(indexes, N) : 6.064 cycles per operation (best) sumrandom(values, N) : 52.113 cycles per operation (best) 52.556 cycles per operation (avg) sumrandomandindexes(values, indexes, N) : 37.835 cycles per operation (best) 38.245 cycles per operation (avg) sumrandomfromindexes(values, indexes, N) : 30.716 cycles per operation (best) 31.593 cycles per operation (avg)

with 16M table size bench:

cc -march=native -O3 -o batchload batchload.c [demo] N= 16777216 populaterandom(indexes, N) : 1.733 cycles per operation (best) sumrandom(values, N) : 37.639 cycles per operation (best) 38.143 cycles per operation (avg) sumrandomandindexes(values, indexes, N) : 34.819 cycles per operation (best) 35.196 cycles per operation (avg) sumrandomfromindexes(values, indexes, N) : 32.992 cycles per operation (best) 33.326 cycles per operation (avg)

performance now is practically the same. So the latency for the murmur hash and the modulo operation must be 15+ cycles. It’s quite good for having a division involved. So the savings come from removing a dependency on the hash table address calculation.

Mystery solved 🙂
1. Daniel Lemire says:
  
  April 14, 2018 at 12:03 pm
  
  I enjoyed your comment, but how does it solve the mystery? Can you elaborate?
alecco says:

April 14, 2018 at 12:47 pm

Oh, sorry, forgot to make it clear. From the post you mention the cost of hashing is 7 cycles yet removing the dependency showed savings of 15+ cycles. That didn’t add up.

Another interesting thing is why is it only 15 cycles given the divq taking in theory 21+ cycles. Well, the compiler generated another version of the function with constant propagation that uses a multiplicative inverse:

sumrandom.constprop.2: .LFB50: .cfi_startproc xorl %r8d, %r8d xorl %esi, %esi movabsq $5675921253449092805, %r9 # multiplicative inverse of 13 .p2align 4,,10 .p2align 3 .L11: movl %esi, %ecx shrl $16, %ecx xorl %esi, %ecx addq $1, %rsi imull $-2048144789, %ecx, %ecx movl %ecx, %eax shrl $13, %eax xorl %eax, %ecx imull $-1028477387, %ecx, %ecx movl %ecx, %eax shrl $16, %eax xorl %eax, %ecx movq %rcx, %rax mulq %r9 # rdx:rax *= 13's mul inverse shrq $22, %rdx # rdx >>= 22 ... leaq (%rdx,%rdx,2), %rax leaq (%rdx,%rax,4), %rax salq $20, %rax subq %rax, %rcx addq (%rdi,%rcx,8), %r8 cmpq $13631488, %rsi jne .L11 movq %r8, %rax ret

For people missing this trick, here is an article on multiplicative inverse trick for modulus. Compilers take advantage of address calculation with leaq for speed. Still this is much more expensive than a simple andl mask. Probably why it’s 15 cycles instead of 21+.
1. alecco says:
  
  April 14, 2018 at 1:35 pm
  
  Still this is much more expensive than a simple andl mask. Probably why it’s 15 cycles instead of 21+.
  
  I meant to say
  
  Still this calculation is much more expensive than a simple andl mask. It probably raises the total cost of hashing to 15 cycles* instead of 21+. And it shows when there’s a direct dependency for the memory access.
alecco says:

April 14, 2018 at 5:39 pm

Forgot to say, IMHO perf reporting cache misses is misleading.
me says:

April 15, 2018 at 8:11 am

Avoid the division (in modulo).

It kills pipelining: the addition depends on the division, so it has to wait for it.

In the array-cached variant, you allow pipelining the divisions by removing the dependency; the subsequent lookups are probably much faster than the divisions. I can only speculate that the cache misses go down because of additional pipelining in the second loop.
alecco says:

April 16, 2018 at 8:14 am

There’s a comment from Saturday awaiting moderation due to it having external links. In that comment I show the version of the function benchmarked is another one with constant propagation. So it doesn’t do divq but it does the division via 13’s multiplicative inverse, which is somewhat faster.

So, to complete the analysis, here is again the original version:

cc -march=native -O3 -o batchload batchload.c [demo] N= 13631488 populaterandom(indexes, N) : 6.075 cycles per operation (best) sumrandom(values, N) : 51.834 cycles per operation (best) 51.904 cycles per operation (avg) sumrandomandindexes(values, indexes, N) : 36.707 cycles per operation (best) 36.962 cycles per operation (avg) sumrandomfromindexes(values, indexes, N) : 31.528 cycles per operation (best) 31.754 cycles per operation (avg)

And now let’s compare with bringing N from outside so it can’t use a constant:

void demo( const int m ) { size_t N = 1024 * 1024 * m;
printf("[demo] N= %zu \n", N); uint64_t *values = malloc(N * sizeof(uint64_t)); uint32_t *indexes = malloc(N * sizeof(uint32_t)); for (size_t i = 0; i < N; i++) { values[i] = i * 3 - 2; } uint64_t answer = sumrandom(values, N); #define repeat 5 BEST_TIME_NOCHECK(populaterandom(indexes, N), , repeat, N, true); BEST_TIME(sumrandom(values, N), answer, , repeat, N, true); BEST_TIME(sumrandomandindexes(values, indexes, N), answer, , repeat, N, true); BEST_TIME(sumrandomfromindexes(values, indexes, N), answer, , repeat, N, true); free(values); free(indexes); }
int main( int argc, char *argv[] ) { if ( argc < 2 ) return -1; int m = atoi( argv[ 1 ] ); demo( m ); return EXIT_SUCCESS; }

and let’s benchmark this version:

$ ./batchload 13 [demo] N= 13631488 populaterandom(indexes, N) : 26.553 cycles per operation (best) sumrandom(values, N) : 80.907 cycles per operation (best) 81.980 cycles per operation (avg) sumrandomandindexes(values, indexes, N) : 58.007 cycles per operation (best) 58.351 cycles per operation (avg) sumrandomfromindexes(values, indexes, N) : 31.712 cycles per operation (best) 32.118 cycles per operation (avg)

The implementation sumrandom.constprop.2 goes away and we see a much slower hashing for both versions and the difference when removing the hash dependency is 23 cycles.

This is an important point often neglected in papers and books. Hash Tables have a cost of converting the table key to the range. This cost is significant and much bigger than the hashing function itself. In many papers what they do is use a table size 2^x to remove this cost. This might not be the case in a production-level implementation as we can’t keep growing tables by 2x forever.

Now I think we are done. 🙂
Mathias Gaunard says:

April 16, 2018 at 9:13 am

You probably want to block to deal well with larger values of ‘size’.
alecco says:

April 16, 2018 at 12:28 pm

@mserdarsanli adapted the code to a web benchmark with C++17 (how cool, huh?)

Seems GCC 7.3 can optimize for the batched case better than clang 6.0. In fact, for the later the caching of hashes is even slower!
alecco says:

April 16, 2018 at 12:36 pm

Same benchmark for GCC 7.3 with 16M entries (andl mask instead of modulus)
alecco says:

April 16, 2018 at 2:52 pm

Last link is wrong.

For 16M entries: GCC 7.3 vs CLANG 6.0 show they are practically the same using andl mask and the original difference on this post doesn’t happen at all. (Like in my first comment)