18th March 2020, 6 min read

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

6 thoughts on “Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores”

degski says:

March 18, 2020 at 4:03 pm

Multi-threading will influence that though, unless I’m misunderstanding things. Changes in a cache-line on one core are not immediately visible in that cache-line viewed from another core [and not necessarily in that order] and needs to be ‘synchronized’ [the tech-term is surely something else, but], which takes time/cycles.

In C++, the alignment of the object will be at least it’s size, aligning on 48 bits is UB [casting a void pointer returned from malloc to a type, does not create (an) object(s) of that type and even for ‘objects’ of type int, this is technically UB, one needs to go through placement new, which imposes the alignment].

Having said that, current compilers don’t seem to have a problem with any of the above.
1. Daniel Lemire says:
  
  March 18, 2020 at 9:01 pm
  
  In this instance, I am relying on Intel intrinsics which have “unaligned” as part of their specification (look for the small “u” in the name). So my code is not relying on undefined behaviour.
burak says:

April 2, 2020 at 6:22 pm

Thank you for sharing this is very interesting read. Yet still I have to split __m256 is it correct? Right now I was struggling with similar problem and got exited when I read this post but I think still not the answer to my problem, that I had 12% cache misses because I had tiled/vectorized a 3 dim large nested array. So each iteration is jumping way forward, without tiling got 1% but additional second on process time
1. Daniel Lemire says:
  
  April 2, 2020 at 7:38 pm
  
  I don’t understand why you would get more cache misses… It should not matter how you read the data as far as cache misses go. You could read the data byte-by-byte… and still get the same number of cache misses.
  1. burak says:
    
    April 2, 2020 at 8:44 pm
    
    Sorry dont have exact answer and for cryptic code, still studying/working on it,now code looks like this:
    
    ..another loop... int siz = n1 - (n1 & 7); int mi = dMatrixInfo[i][1]; for (int j = 0; j < siz; j = j + 8) { ... float *d2 = &(d[mi * m++]); float *d22 = &(d[mi * m++]); float *d23 = &(d[mi * m++]); float *d24 = &(d[mi * m++]); float *d25 = &(d[mi * m++]); float *d26 = &(d[mi * m++]); float *d27 = &(d[mi * m++]); float *d28 = &(d[mi * m]); int size = n2 - (n2 & 7); for (int k = 0; k < size; k = k + 8) { _mulAddBroadcast(&d2[k], &eVal, &n[k]); _mulAddBroadcast(&d22[k], &eVal2, &n[k]); _mulAddBroadcast(&d23[k], &eVal3, &n[k]); _mulAddBroadcast(&d24[k], &eVal4, &n[k]); _mulAddBroadcast(&d25[k], &eVal5, &n[k]); _mulAddBroadcast(&d26[k], &eVal6, &n[k]); _mulAddBroadcast(&d27[k], &eVal7, &n[k]); _mulAddBroadcast(&d28[k], &eVal8, &n[k]); ....
    
    which I have tiled from
    
    int size = n2 - (n2 & 7); for (int d = 0; d < size; d += 8) { _mulAddBroadcast(&d2[d], &eVal, &n[d]); } for (int d = size; d < n2; d++) { d2[d] = fma(eVal, n[d], d2[d]); }
Charles Goodwin says:

April 14, 2020 at 10:27 pm

I’ve been using headers on various data types, so a string might be stored as a pointer to the character data, with some extra information stored immediately before the first character in memory. If I access the character data, then step backwards to get an item from the header, am I likely to get a cache miss / stall that I might not get if I stored a pointer to the start of the header and accessed the character data as an offset from that?