Daniel Lemire's blog

, 47 min read

Data alignment for speed: myth or reality?

53 thoughts on “Data alignment for speed: myth or reality?”

  1. Itman says:

    Hi Daniel,

    You are testing this using new hardware. I asked the very same question in 2009 and in 2011.

    Results are speaking for themselves. This is not just a fluke. Intel did change the architecture.

    Then in 2009:
    time 33837 us -> 0%
    time 47012 us -> 38%
    time 47065 us -> 39%
    time 47001 us -> 38%
    time 33788 us -> 0%
    time 47018 us -> 39%
    time 47049 us -> 39%
    time 47014 us -> 39%

    Now in 2011:

    time 89400 us -> 0%
    time 90374 us -> 1%
    time 90299 us -> 1%
    time 90365 us -> 1%
    time 89348 us -> 0%
    time 90672 us -> 1%
    time 90372 us -> 1%
    time 90318 us -> 1%

  2. Thomas says:

    Interesting.
    The version of the myth that I heard said the slowdown was because the processor will have to do two aligned reads and then construct the unaligned read from that. If I read your code correctly, you’re accessing memory sequentially. In that case, the extra reads might not hit memory. If the compiler figures it out, there might not even be extra reads. Well, actually, I don’t really know what I’m talking about with these low-level things. Still. I’m not saying your test is wrong, but it is always important to be careful about what you’re actually testing and how that generalises.

    That said, these low-level tests you’re posting are really cool :). Thanks.

    1. Henry says:

      Maybe that’s the point though. Maybe cache behavior and cache-line alignment (rather than word alignment) is the dominant factor here.

      It raises the question whether, on some processors, you could get good results by scrapping word alignment all together and just worrying about cache.

      1. Maynard+Handley says:

        What people commenting on this are missing is that the “addressable” units of a cache may be different from what you think.

        It’s natural to believe that the basic unit of a cache is the cache line (say 64B) and that reading from that is like reading from an array. But in fact most designs split a cache line into multiple smaller pieces that are somewhat independently addressable, either for power reasons or for banking reasons (ie to allow multiple independent accesses into the same line in a cycle).
        Intel up till, I think Sandy Bridge, banked by 8B units (ie imagine the stack of cache lines 64B wide, split them at 8B boundaries, and these 8 sub-arrays form 8 independently addressable banks). This was subject to frequent (ie noticeable) bank collisions (two accesses want to hit the same bank, perhaps in the same row, perhaps in different rows). This was somehow improved in Sandy Bridge and later — though apparently it’s still based on 8B wide units, and details remain uncertain/unknown.

        It’s not clear, in such a design, that a load that crosses two banks would automatically cost no extra cycles (perhaps the logic to shift and stitch two loads together might require an extra cycle). This is even more true in really old (by our standards) designs which can sustain only one access (or perhaps one load and one store) per cycle, so may have to run the load twice to get the two pieces before stitching them.

        In the case of Apple (this is probably just as true of other manufacturers) we can see from the patent record how they do this.

        Apple has two attempts:
        The first is 2005 https://patents.google.com/patent/US8117404B2.
        Notable features are
        – use of a predictor to track loads or stores that will cross cache line boundaries
        – if the predictor fires then crack the instruction into three parts that load low, load high, and stitch.
        In other words
        – already at this stage (2005) the problem is when crossing a line. Within line is not a problem — we’ll see why.
        – crossing a line is handled slowly but by reusing existing CPU functionality.

        The 2011 patent, https://patents.google.com/patent/US9131899B2 , is much more slick. Here the important parts are:
        – the cache is defined as two banks, one holding even lines, one holding odd lines. Regardless of their interior sub-banking, these two banks can be accessed in parallel.
        – the array holding addresses for the store queue has a structure that can also describe both an even line and an odd line, with the store data held in a separate array.

        These mean that
        – a load that straddles lines (and in the usual case of both lines in cache) can access both banks in parallel and pick up the data, and stitch it within cycle time. ie you don’t even pay a penalty for crossing cache lines.
        – a load that hits in store queue can, likewise, do the equivalent of probing for address matches in the even and odd halves of line address space to detect presence of matching non-aligned stores.
        – there is also a piece of auxiliary storage (the “sidecar”) to hold parts of the load for the truly horror story cases like part of the load from a line in cache, part in DRAM, or a short store that crosses a cache boundary and is within a large load that also crosses a cache boundary (so some data from store queue, some data from L1D).

        So almost all cases pay no latency cost, though they will pay a bandwidth cost (since the loads will use two “units” of the limited bandwidth to the cache). I believe that a transaction that crosses a page will pay a one cycle latency cost because the TB will have to be hit twice, but I have not validated that.

        The reason I started (for Apple) with cache line crossings is that the Apple sub-arrays within the two banks are remarkably short!
        (2014) https://patents.google.com/patent/US9448936B2
        describes them as being one byte wide. This means (among other things) that stores can happen concurrently with loads to the same line, as long as they touch different bytes, and that is the focus of the patent.
        As far as I can tell experimentally (though the full picture still remains murky) this is true as of the M1, except that the minimal width is the double-byte not the single byte.

        In other words, whereas the Intel history was something like
        – extract all the data from an 8B unit — that will cover an aligned 8B load and many misaligned shorter loads, then
        – expand that to, in one cycle, extract from two 8B units and stitch

        the Apple history was more like
        – from the beginning, figure out which sub-banks (ie which bytes within a line) to activate and
        – how to route the bytes collected from those sub-banks to the load store unit

        What seems like a reasonable compromise, or an easy extension, in each case depends on your starting point — in this case “aggregating bytes” vs “de-aggregating oct-bytes”.

  3. Thomas says:

    Ah, from the page you linked to: “On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.”
    Sorry for commenting before reading the linked material ;).

  4. Daniel,

    I do not agree with your comment saying that it is a cache issue. All the memory used in this test is most likely in the cache.

    It really is a case of two 256-bit wide read instead of one the word you are reading is crossing the boundary.

    (Note: I guess it is 256-bit wide, it might really be 128-bit wide reads, its just that 256-bit wide reads seems more likely)

  5. @Thomas

    Thanks for the good words.

    I agree that you have to be careful, but that is why I post all my source code. If I’m wrong, then someone can hopefully show it by tweaking my experiment.

  6. Testing this is going to be tricky. Last I looked closely, CPUs fetch and cache chunks of words from memory. So unaligned sequential access are going to make no extra trips to memory (for the most part). If the cache handles misaligned references efficiently, then you might see little or no cost to misaligned access.

    If a misaligned reference crosses a chunk boundary, so that two chunks are pulled from memory, and only a single word is used of the two chunks, then you might see a considerable cost.

    Without building in knowledge of specific CPUs, you could construct a test. Allocate a buffer much larger than CPU cache. Access a single word, bump the pointer by a power of two, increasing the exponent on each sweep of the buffer (until the stride is bigger than the cache line). Repeat the set of sweeps, bumping the starting index by one byte, until the starting index exceeds a cache line. (You are going to have to lookup the biggest cache line for any CPU you test.)

    What I expect you to see is that most sweeps are fairly uniform, with spikes where a misaligned access crosses a cache line boundary.

    What that means (if true) is that most misaligned access will cost you very little, with the occasional optimally cache misaligned joker. (Requires the right starting offset and stride.)

    Still worth some attention, but much less likely you will get bit.

  7. Thomas says:

    This cache-chunk business sounds reasonable, but luckily also sounds like it might be relatively rare. And then you would care for cache-chunk alignedness, not something like word alignedness.

    I just had a look at the disassembly of an optimised build by Visual Studio 10. It looks to me like it is indeed doing unaligned reads.

    p.s.: It seems to have unrolled the Rabin-Karp-like loop 5(?!) times.

  8. @Thomas

    5 times?

  9. Thomas says:

    Right, I guess “unrolled 5 times” doesn’t mean what I wanted to say. I mean: it does 5 iterations, then jump back.

  10. Itman says:

    Thomas,

    BTW, unrolling loops did not get you a performance boost either. In the case of new hardware, in most situations.

  11. For most general random logic, this is unlikely to bite. The special cases are a *little* less unlikely than they appear. Power-of-two sized structures are not at all unlikely for some sorts of large problems. Accessing only the first or last word of a block is a common pattern. If the block start is misaligned, and the block is a multiple of a cache line size … you could get bit.

  12. @Bannister

    There is now an example closely related to your analysis in github. I have also updated my blog post accordingly.

  13. @Laurent

    I have removed the misleading comment.

  14. A. Non says:

    The speed of unaligned access is architecture-dependent. On the DEC Alphas, it would slow down your program by immense amounts because each unaligned access generated an interrupt. Since we don’t know what the future holds, its best to design your programs so they have aligned accesses when possible. After all, unaligned access has NEVER been faster than aligned access.

  15. @A. Non

    Certainly, avoiding unaligned accesses makes your code more portable, but if you are programming in C/C++ with performance in mind, you are probably making many portability trade-offs anyhow (e.g., using SSE instructions).

  16. Yifei says:

    if other operations in your code is slower than memory access, then most of the time is spent on the other operations, so you can’t see the difference between align vs un-align.

    In your source code, the other operation is the multiply.

    You can try memory copy instead — use a for loop to copy an array manually.

  17. @Yifei

    After disabling the multiplication, I get the same sort of result.

    Multiplication over integers is not expensive on modern processors due to pipelining.

  18. Interesting article thanks!

    I used to do a lot of ARM coding, and from what I remember exactly what the ARM does on a unaligned access depends on how the supporting hardware has been set up.

    You can either get an abort which then gives the kernel an opportunity to fixup the nonaligned access in software (very slow!)

    Or you can read a byte rotated word, so if you read a word at offset 1 you would read the 32 byte word at offset 0 but rotated by 8 bits. That was actually useful sometimes!

    I’m not sure about newer ARMs though.

  19. Alecco says:

    DarkShikari has some great insight on this issue, 4 years ago now.

    Cacheline splits, aka Intel hell (Feb 2008)
    http://x264dev.multimedia.cx/archives/8

    Nehalem optimizations: the powerful new Core i7 (Nov 2008)
    http://x264dev.multimedia.cx/archives/51

  20. Richard says:

    You need to include stdint.h before using uintptr_t.

  21. This is an interesting post. I did a related analysis a little while ago looking at the specific case of speeding up convolution (very much a real world example):
    https://hgomersall.wordpress.com/2012/11/02/speedy-fast-1d-convolution-with-sse/
    and code here:
    https://github.com/hgomersall/SSE-convolution

    Even with modern hardware (an i7-5600), there is a substantial improvement (~20%) when aligned loads are used in the inner loop, at least for SSE instructions, even when additional necessary array copies are factored in.

    In my example, I compared SSE initially but extended it to AVX (in the code), though without repeating the aligned vs unaligned experiments., though it turns out the benefit is not so apparent when going to 256 byte alignment from 128 byte alignemt (the half alignment is good enough for AVX).

    1. Interesting. Can you point me directly to two functions where the only difference is that one uses aligned load SSE instructions and the other one uses unaligned load instructions? What I see in convolve.c is a function using unaligned loads, but it seemingly relies on a different (more sophisticated) algorithm.

      1. I extended my code a bit to test some additional cases.

        It seems that the results show that aligned SSE is important (compare `convolve_sse_partial_unroll` and `convolve_sse_in_aligned`) but aligned AVX makes very little difference (`convolve_avx_unrolled_vector` versus `convolve_avx_unrolled_vector_unaligned`).

        The fastest I can achieve is pretty close to the maximum theoretical throughput (about 90% of the peak clock speed * 8 flops), which is using the unaligned load, which agrees with your assessment that alignment is pretty unimportant – this is the `convolve_avx_unrolled_vector_unaligned` case.

        It’s interesting that the SSE operations don’t benefit from the better AVX unaligned loads.

  22. Olumide says:

    I too did not really believe that misaligned data could significantly affect runtimes until I fixed the alignment of my data structures and the performance jumped by about 750% (from about 3.9 seconds to 0.04 seconds).

    What’s more interesting is that I fixed the alignment by changing from floats to doubles on a Intel i7 (64 bit) which made all my data structures and data (doubles) have the same 8 byte alignment (sometimes more is more). Using floats (4 byte aligned) meant that my data was unaligned with the other data structures. Alignment matters. Intel says so. Please stop suggesting otherwise.

    1. Can you share a code sample that shows that by changing data alignment, you are able to multiply by 100 the speed (from 4 s to 0.04 s)?

      Please keep the data types the same. Alignment and data types are distinct concerns.

  23. Stan Chang says:

    Coming from embedded side with older processors such as MIPS32 and PPC, properly aligned data structure was de facto.
    Seeing is believing, both test results (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/05/31/test.cpp and https://www.ibm.com/developerworks/library/pa-dalign/) showed no noticeable differences on PPC e500v2 and Xeon E5-2660 v2, whether or not aligned.
    However, per https://software.intel.com/en-us/articles/data-alignment-when-migrating-to-64-bit-intel-architecture, ‘The fundamental rule of data alignment is that the safest (and most widely supported) approach relies on what Intel terms “the natural boundaries.”‘
    Very interesting! Thanks for sharing.

  24. David Bailey says:

    Daniel,

    Thanks for that! I have long felt that alignment requirements spoiled computing because they aren’t just issues for compiler writers, they complicate general programming as well. For example, a structure containing 32-bit integers and doubles is no longer just a collection of adjacent items, gaps are inserted to maintain alignment.

    To me (I am a bit long in the tooth) they remind me of the way many architectures used to divide memory into segments. This structure was supposedly useful, but it gradually became clear that the disruption as you crossed a segment boundary was definitely not useful!

    I hope computer architectures soon evolve into being totally alignment free – including operations such as MULPD, which actually faults if the data isn’t 16-byte aligned, even though the individual data items are 8-bytes long!

  25. Ras says:
  26. Matthieu M. says:

    Note: unaligned access has an impact beyond performance, though.

    Specifically, the C and C++ Standard specify that unaligned access is Undefined Behavior. This, in turn, means that a C++ compiler can reasonably expect that if `int` is 4-bytes aligned, then an `int*` has an address divisible by 4.

    At the very least, I remember an instance of gcc optimizing away the “misalignment checks” performed on an `int*`, which in turn resulted in accessing an array out-of-bounds (because it did not have a number of bytes divisible by 4).

    I would be very wary of using unaligned access directly in C or C++, unless specifically supported by the compiler (packed structures). Assembly can get away with it, but in C and C++ it’s dangerous.

    1. @Matthieu

      A very good point: it is potentially unsafe in C even if the underlying hardware is happy with unaligned loads and stores. But you can make it safe by calling memcpy which gcc will translate into a load on an x64 machine, without performance penalty.

  27. Steve says:

    Cross cache line locked instructions are a nightmare, particularly with multiple cores hitting the same cache line.

    1. @Steve

      Can you provide a benchmark?

      1. Steve says:

        Pastebin isn’t my favorite, but it’s late… Updated test.cpp: http://pastebin.com/EqtvpZeS

        Speaking of it being late, this may have, uh, errors in it. Inline asm isn’t really meant to be written half-asleep.

        Anyway, the results:
        “””
        processing word of size 4

        average time for offset 60 is 1.4

        average time for offset 61 is 102
        “””

        Superficially, I’d call this a factor of 70, but I distrust my lack of samples. Certainly, at least an order of magnitude slow down, and this is without intercpu conflicts.

        1. Steve says:

          Forgot to specify: CPUID says my CPU has 64-byte cache lines. Not sure this really changes on Intel CPUs often, but in case it has, it’s worth checking /proc/cpuinfo or whatever your platform provides.

          1. All x64 processors seem to have a 64-byte cache line.

        2. I had no trouble reproducing your results, thanks.

  28. Cev Ing says:

    This is a really silly test, because alignment can not be tested with a C or C++ compiler, because the compiler is free to align the data in any way he wants. The compiler produces only aligned instruction. This test can only be done in an assembly language. Processor manufacturers explain in detail the penalty of unaligned code:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDIJAFG.html

    Unaligned word or halfword loads or stores add penalty cycles. A byte aligned halfword load or store adds one extra cycle to perform the operation as two bytes. A halfword aligned word load or store adds one extra cycle to perform the operation as two halfwords. A byte-aligned word load or store adds two extra cycles to perform the operation as a byte, a halfword, and a byte.

    1. You are linking to the documentation of 7-year-old ARM microcontroller processor whereas my blog post explicitly addresses x86 processors.

      There are certainly microcontroller processors made today where unaligned loads are not an option. That’s not what my blog post is about.

      The compiler produces only aligned instruction. This test can only be done in an assembly language.

      Keeping in mind that my blog post addresses x86 processors and not, say, microcontrollers, which instruction would that be? The compiler generates mov which is what I’d use in assembly. What else would you use?

  29. doubleday says:

    >> because alignment can not be tested with a C or C++ compiler

    you might want to google `alignas` …

  30. archimede says:

    For correct measurements:

    1) Taking the time has a cost. You need to repeat the experiment multiple times without stopping the timer, then divide by the number of repetitions.
    2) Tye vector size should fit in L1 cache. This seems ok in your code.
    3) The loop should be carried out with intrinsic s which explicitly use aligned or unaligned load instructions which cross the cache line.

    1. archimede says:

      To be more clear: you need to use simd instructions!

  31. John Campbell says:

    Interesting to read your comments on the alignment “myth”. I have also not been able to reproduce performance delays due to alignment on i5 & i7 processors.
    I do wonder if 8-byte reals that span memory pages could have problems, but they are low probability events.
    Problems of 8-byte, 16-byte or 32-byte alignment are also posed for real*8 AVX calculations, although I also can’t demonstrate these.
    My conclusion is : by far the most significant issue for AVX performance is having the values in cache (which cache?), rather than their memory alignment.

    1. by far the most significant issue for AVX performance is having the values in cache (which cache?), rather than their memory alignment.

      Right. With AVX you can produce scenarios where alignment matters… but in real code, I think it can be safely ignored as an issue. It is unlikely that alignment is ever going to be a bottleneck.

  32. Marc Lehmann says:

    Indeed, newer intel chips do not suffer from noteworthy alignment penalties.

    However, a lot of people who come here or refer to this article use it to make the (wrong) claim that you can make unaligned accesses in C or C++, and indeed, the C++ program presented here might crash due to illegal unaligned accesses, even on X86 or similar architectures.

    The reason is that compilers for those architectures can (and regularly) assume proper alignment for types, and can take advantage of instructions that require alignment (which exist even on X86, mostly in the form of SIMD insns). The reason why it almost always works is because compilers in the past rarely took advantage of these instructions, but this is rapdily changing, causing more and more unaligned accesses to crash.

    One correct way to make an unaligned access in C or C++ is using memcpy, i.e. instead of:

    val = *unaligned_uint64_t_ptr;

    You do:

    memcpy (val, unaligned_uint64_t_ptr, sizeof (val));

    This works on any architecture (that has uint7_t :), and usually is optimized into an instruction guaranteed to support unaligned accesses on X86/X86_64, so doe snot typically incur a speed penalty.

    1. This is correct, one should use memcpy to avoid undefined behavior and possible bugs.

      My code presented here is not technically correct though it does get the job done.

      1. Marc Lehmann says:

        That’s my point – your code only gets the job done with compiler extensions, specifically, it must not optimize too much and must not take future CPUs into account. It is a bad example on how to take advantage of unaligned accesses, even if it is immaterial to the point you are making.

        Or to put it differently, since your code invokes undefined behaviour, it’s not actually proving your point, your measurement is flawed, even if it happens to give correct results accidentally – but who knows what the code really does…

        1. I agree with you.

          In this instance, we have looked at the assembly code, so we know that it does what we think it does because we have looked at the assembly code.

          If I had to redo this post and the code, I would definitively use memcpy.

  33. Guilherme says:

    Hi, I’ve run your tests and it seems that the unaligned is actually faster than aligned (?). I am on a Mid 2015 Macbook Pro.

    ➜ data-alignment-test gcc -Ofast -o bench_alignment_unaligned bench_alignment_unaligned.c
    bench_alignment_unaligned.c:75:45: warning: format specifies type 'int' but the argument has type
    'unsigned long' [-Wformat]
    printf("result = %d, time = %d\n", result, ((end-start)/CLOCKS_PER_SEC));
    ~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    %lu
    1 warning generated.
    ➜ data-alignment-test gcc -Ofast -o bench_alignment bench_alignment.c
    bench_alignment.c:73:45: warning: format specifies type 'int' but the argument has type
    'unsigned long' [-Wformat]
    printf("result = %d, time = %d\n", result, ((end-start)/CLOCKS_PER_SEC));
    ~~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    %lu
    1 warning generated.
    ➜ data-alignment-test ./bench_alignment
    result = -886424448, time = 38
    ➜ data-alignment-test ./bench_alignment_unaligned
    result = -889155136, time = 27

  34. Frank says:

    Generally speaking, I’ve always written C to be as portable as possible. It’s not hard to align data so I do so. Maybe my current microprocessor doesn’t care, but the next one to run this code might.

    Without writing an example program, I think I could sketch out a potential performance problem of unaligned data, that suffers from “false sharing.” This is when two cores are trying to read and write from the same cache line. Once one writes, it invalidates the any other reader’s cache, forcing them to access main memory.

    For instance, let us make an array of cache-line-sized (64 byte) data objects, with a number of entries equal to number of cores. Have each core read the data into memory, and treating it as a 64-byte int, increment it. Then, write it back out. Have each thread do a billion increments.

    I imagine such a program might be 20-30x faster should the data be aligned on 64-byte boundaries, totally eliminating false sharing, than with any other alignment, which would guarantee it.

    I do understand this isn’t exactly the issue you’re discussing, but it is a fairly closely-associated one.

  35. Jon Ross says:

    I ran Laurent Gauthier’s counter-example, and the results no longer show a difference. At least not on this specific CPU, and GCC-11.


    ``#> head /proc/cpuinfo | grep "model name"
    model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

    #> ./bench_alignment
    result = -1223964992, time = 33
    #> ./bench_alignment_unaligned
    result = 490453024, time = 30

  36. Demindiro says:

    (Apologies if I posted this comment twice, it wasn’t obvious to me whether it actually got submitted without Javascript enabled)

    I have a Ryzen 2700X and while I observe only sometimes penalties when reading from an unaligned address (be it SIMD or not) there is a potentially severe penalty when writing unaligned data.

    I use the following test program (compiled with gcc -nostartfiles main.s):

    .intel_syntax noprefix
    .globl _start

    .section .text._start
    _start:

    mov rcx, 1000 * 1000 * 1000
    lea rax, [rip + p + 0]

    2:
    #vmovdqu xmm0, [rax]
    #vmovdqu [rax], xmm0
    #movdqu xmm0, [rax]
    #movdqu [rax], xmm0
    #mov edi, [rax]
    mov [rax], edi

    dec rcx
    jnz 2b

    mov eax, 60
    xor edi, edi
    syscall
    ud2

    .section .bss
    .p2align 12
    .zero 4096 – 128
    p: .zero 64 # cache boundary
    q: .zero 64 # page boundary
    .zero 64

    These are some of the results I get:

    – mov [rax], edi & p + 0: 264.23 msec
    – mov [rax], edi & p + 1: 265.35 msec
    – mov [rax], edi & p + 5: 272.88 msec
    – mov [rax], edi & p + 62: 1225.18 msec
    – mov [rax], edi & p + 63: 1227.44 msec
    – mov [rax], edi & q + 62: 5758.66 msec
    – mov [rax], edi & q + 63: 5735.20 msec

    So a non-SIMD store within a cache line does not seem to have any penalty, but crossing a cache boundary imposes a heavy penalty (~5x). Crossing a page boundary imposes a devastating penalty (~21x).

    When using SIMD instructions alignment matters even within a cache line:

    – movdqu [rax], xmm0 & p + 0: 271.28 msec
    – movdqu [rax], xmm0 & p + 8: 517.36 msec
    – movdqu [rax], xmm0 & p + 4: 500.24 msec
    – movdqu [rax], xmm0 & p + 2: 1230.46 msec
    – movdqu [rax], xmm0 & p + 1: 1233.34 msec
    – movdqu [rax], xmm0 & p + 63: 1249.21 msec
    – movdqu [rax], xmm0 & q + 63: 5692.39 msec

    When reading data a penalty may be observed when loading many times per iteration:

    – vmovdqu xmm0, [rax] & p + 0: 272.77 msec
    – vmovdqu xmm0, [rax] & q + 63: 275.50 msec
    – 3x vmovdqu xmm0, [rax] & p + 0: 385.73 msec
    – 3x vmovdqu xmm0, [rax] & p + 1: 389.73 msec
    – 3x vmovdqu xmm0, [rax] & p + 63: 745.43 msec
    – 3x vmovdqu xmm0, [rax] & q + 63: 750.51 msec

    So while alignment may be disregarded when reading data on modern CPUs it is still very important to align data when writing.