, 7 min read
Can 32-byte alignment alleviate 4K aliasing?
In my previous post, I considered some performance problems that can plague simple loops that read and write data…
for (int i = 0; i < a.length; ++i) {
a[i] += s * b[i];
}
Once vectorized, such a loop can suffer from what Intel calls 4K Aliasing: if the arrays are stored in memory at locations that differ by “almost” a multiple of 4kB, then it can look to the processor like you are storing data and quickly reading it again. This confuses the processor because it only discriminates on addresses using the least significant 12 bits of addresses.
In my previous blog post, I conjectured that the problem is harder to generate if your arrays are aligned on 32-byte boundaries (that is, if the array starts at an address that is divisible by 32).
Aleksey Shipilev disagreed and stated on Twitter that the problem was easily reproducible even when arrays are 32-byte aligned.
Aleksey refers to work done on OpenJDK to prevent this problem with respect to array copies. Aleksey used an older Haswell processor, instead of the more recent Skylake processor that I am using. So I am going to go back to a Haswell processor.
What I am going to do this time is align the destination array (a) on 32-byte boundaries, always. Then I am going to set the other array 4096 bytes ahead, minus some offset in bytes. I then report the number of cycles used per array element.
$ gcc -O3 -o align32byte align32byte.c -mavx2 && ./align32byte
offset: 0 bytes
vecdaxpy(a, b, s, N) : 0.543 cycles per operation
offset: 1 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 2 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 3 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 4 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 5 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 6 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 7 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 8 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 9 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 10 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 11 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 12 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 13 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 14 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 15 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 16 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 17 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 18 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 19 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 20 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 21 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 22 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 23 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 24 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 25 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 26 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 27 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 28 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 29 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 30 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 31 bytes
vecdaxpy(a, b, s, N) : 0.707 cycles per operation
offset: 32 bytes (1 32-byte)
vecdaxpy(a, b, s, N) : 0.648 cycles per operation
offset: 64 bytes (2 32-byte)
vecdaxpy(a, b, s, N) : 0.637 cycles per operation
offset: 96 bytes (3 32-byte)
vecdaxpy(a, b, s, N) : 0.625 cycles per operation
offset: 128 bytes (4 32-byte)
vecdaxpy(a, b, s, N) : 0.613 cycles per operation
offset: 160 bytes (5 32-byte)
vecdaxpy(a, b, s, N) : 0.613 cycles per operation
offset: 192 bytes (6 32-byte)
vecdaxpy(a, b, s, N) : 0.613 cycles per operation
offset: 224 bytes (7 32-byte)
vecdaxpy(a, b, s, N) : 0.625 cycles per operation
offset: 256 bytes (8 32-byte)
vecdaxpy(a, b, s, N) : 0.613 cycles per operation
offset: 288 bytes (9 32-byte)
vecdaxpy(a, b, s, N) : 0.625 cycles per operation
offset: 320 bytes (10 32-byte)
vecdaxpy(a, b, s, N) : 0.613 cycles per operation
offset: 352 bytes (11 32-byte)
vecdaxpy(a, b, s, N) : 0.602 cycles per operation
offset: 384 bytes (12 32-byte)
vecdaxpy(a, b, s, N) : 0.590 cycles per operation
offset: 416 bytes (13 32-byte)
vecdaxpy(a, b, s, N) : 0.566 cycles per operation
offset: 448 bytes (14 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 480 bytes (15 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 512 bytes (16 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
So there is a 30% performance penalty for offsets by a small number of bytes. The penalty drops to 20% or less as soon as the second array is aligned on a 32-byte boundary. The penalty eventually goes away entirely when the offset reaches about 512 bytes.
Let us run the same experiment on my Skylake processor:
$ gcc -O3 -o align32byte align32byte.c -mavx2 && ./align32byte
offset: 0 bytes
vecdaxpy(a, b, s, N) : 0.477 cycles per operation
offset: 1 bytes
vecdaxpy(a, b, s, N) : 0.805 cycles per operation
offset: 2 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 3 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 4 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 5 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 6 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 7 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 8 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 9 bytes
vecdaxpy(a, b, s, N) : 0.805 cycles per operation
offset: 10 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 11 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 12 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 13 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 14 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 15 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 16 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 17 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 18 bytes
vecdaxpy(a, b, s, N) : 0.812 cycles per operation
offset: 19 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 20 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 21 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 22 bytes
vecdaxpy(a, b, s, N) : 0.805 cycles per operation
offset: 23 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 24 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 25 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 26 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 27 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 28 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 29 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 30 bytes
vecdaxpy(a, b, s, N) : 0.797 cycles per operation
offset: 31 bytes
vecdaxpy(a, b, s, N) : 0.789 cycles per operation
offset: 32 bytes (1 32-byte)
vecdaxpy(a, b, s, N) : 0.586 cycles per operation
offset: 64 bytes (2 32-byte)
vecdaxpy(a, b, s, N) : 0.570 cycles per operation
offset: 96 bytes (3 32-byte)
vecdaxpy(a, b, s, N) : 0.562 cycles per operation
offset: 128 bytes (4 32-byte)
vecdaxpy(a, b, s, N) : 0.562 cycles per operation
offset: 160 bytes (5 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 192 bytes (6 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 224 bytes (7 32-byte)
vecdaxpy(a, b, s, N) : 0.547 cycles per operation
offset: 256 bytes (8 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 288 bytes (9 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 320 bytes (10 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 352 bytes (11 32-byte)
vecdaxpy(a, b, s, N) : 0.555 cycles per operation
offset: 384 bytes (12 32-byte)
vecdaxpy(a, b, s, N) : 0.523 cycles per operation
offset: 416 bytes (13 32-byte)
vecdaxpy(a, b, s, N) : 0.508 cycles per operation
offset: 448 bytes (14 32-byte)
vecdaxpy(a, b, s, N) : 0.500 cycles per operation
offset: 480 bytes (15 32-byte)
vecdaxpy(a, b, s, N) : 0.484 cycles per operation
offset: 512 bytes (16 32-byte)
vecdaxpy(a, b, s, N) : 0.477 cycles per operation
Interestingly, the penalty on Skylake is higher (over 50%) for offsets by arbitrary numbers of bytes. But when the arrays differ by multiple of 32 bytes, the penalty is reduced to about 20%, and it eventually goes away entirely.
My numbers suggest the following observations:
- It does not look like 4K aliasing improved between Haswell and Skylake. I have a recent laptop with an even more recent Kaby Lake processor, and the problem is still there. I am not including my laptop numbers because I do not trust benchmarking on a laptop… but I have enough confidence to state that the 4k aliasing problem is still a thing on the very latest Intel processors. What about AMD processors?
- Aligning arrays on 32-byte boundaries does seem to alleviate the problem. In my particular experiments, I only got a 20% penalty due to aliasing which seems more acceptable than a 50% penalty.
To me, it is an annoying problem because it means that when doing rather common computations over loops, your performance will differ quite a bit depending on where in the memory your arrays are located. Memory addresses is not something you have a lot of control over in the normal course of software. It is also not obvious to me how to make the problem go away. You can try to detect the problem and make the loop run in reverse… but it is not entirely trivial.