Daniel Lemire's blog

, 6 min read

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

6 thoughts on “Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores”

  1. degski says:

    Multi-threading will influence that though, unless I’m misunderstanding things. Changes in a cache-line on one core are not immediately visible in that cache-line viewed from another core [and not necessarily in that order] and needs to be ‘synchronized’ [the tech-term is surely something else, but], which takes time/cycles.

    In C++, the alignment of the object will be at least it’s size, aligning on 48 bits is UB [casting a void pointer returned from malloc to a type, does not create (an) object(s) of that type and even for ‘objects’ of type int, this is technically UB, one needs to go through placement new, which imposes the alignment].

    Having said that, current compilers don’t seem to have a problem with any of the above.

    1. In this instance, I am relying on Intel intrinsics which have “unaligned” as part of their specification (look for the small “u” in the name). So my code is not relying on undefined behaviour.

  2. burak says:

    Thank you for sharing this is very interesting read. Yet still I have to split __m256 is it correct? Right now I was struggling with similar problem and got exited when I read this post but I think still not the answer to my problem, that I had 12% cache misses because I had tiled/vectorized a 3 dim large nested array. So each iteration is jumping way forward, without tiling got 1% but additional second on process time

    1. I don’t understand why you would get more cache misses… It should not matter how you read the data as far as cache misses go. You could read the data byte-by-byte… and still get the same number of cache misses.

      1. burak says:

        Sorry dont have exact answer and for cryptic code, still studying/working on it,now code looks like this:

        ..another loop...
        int siz = n1 - (n1 & 7);
        int mi = dMatrixInfo[i][1];
        for (int j = 0; j < siz; j = j + 8) {
        float *d2 = &(d[mi * m++]);
        float *d22 = &(d[mi * m++]);
        float *d23 = &(d[mi * m++]);
        float *d24 = &(d[mi * m++]);
        float *d25 = &(d[mi * m++]);
        float *d26 = &(d[mi * m++]);
        float *d27 = &(d[mi * m++]);
        float *d28 = &(d[mi * m]);
        int size = n2 - (n2 & 7);
        for (int k = 0; k < size; k = k + 8) {
        _mulAddBroadcast(&d2[k], &eVal, &n[k]);
        _mulAddBroadcast(&d22[k], &eVal2, &n[k]);
        _mulAddBroadcast(&d23[k], &eVal3, &n[k]);
        _mulAddBroadcast(&d24[k], &eVal4, &n[k]);
        _mulAddBroadcast(&d25[k], &eVal5, &n[k]);
        _mulAddBroadcast(&d26[k], &eVal6, &n[k]);
        _mulAddBroadcast(&d27[k], &eVal7, &n[k]);
        _mulAddBroadcast(&d28[k], &eVal8, &n[k]);

        which I have tiled from

        int size = n2 - (n2 & 7);
        for (int d = 0; d < size; d += 8) {
        _mulAddBroadcast(&d2[d], &eVal, &n[d]);
        for (int d = size; d < n2; d++) {
        d2[d] = fma(eVal, n[d], d2[d]);

  3. I’ve been using headers on various data types, so a string might be stored as a pointer to the character data, with some extra information stored immediately before the first character in memory. If I access the character data, then step backwards to get an item from the header, am I likely to get a cache miss / stall that I might not get if I stored a pointer to the start of the header and accessed the character data as an offset from that?