Daniel Lemire's blog

, 28 min read

How fast can you bit-interleave 32-bit integers?

20 thoughts on “How fast can you bit-interleave 32-bit integers?”

  1. Evan says:

    Your post makes it sound like PDEP and PEXT are available on all AMD64 CPUs, but that’s not the case; they’re in BMI2, which is an optional extension to AMD64. IIRC they appear in ≥ Haswell (2013) on the Intel side, and Excavator (2015) on AMD, so you need a pretty new CPU.

    Reading this post, people may think to use the `__amd64` macro (or `_M_AMD64` for VS), but `__BMI2__` is the right one on GCC-like compilers (I don’t know that VS has one). Run-time detection is a bit more complicated if you want to do it portably, but if all you care about is GCC 5.1+: `__builtin_cpu_init()` and `__builtin_cpu_supports(“bmi2”)`.

    1. Everything you wrote is correct, I believe.

    2. Kendall Willets says:

      IIRC they’re also excruciatingly slow on Ryzen.

      1. I see this. 18 cycles of latency?

        Probably that AMD has figured out that these instructions only rarely occur in live binaries.

        1. Travis Downs says:

          Yes, I bet it has a microscopic presence in binaries today because (a) it was introduced relatively recently in the BMI2 extension (Haswell) and (b) the operation is pretty obscure: it is very unlikely a compiler would ever generate this instruction when compiling – use will pretty much be entirely from hand-written asm or intrinsic calls as in your example.

          That said, this (and pext) are my two favorite newly introduced scalar instructions: they are very powerful in that they are hard to emulate (that’s why it takes 18 cycles!), unlike many of the rest of the BMI instructions, which are often fairly trivial combinations of a few primitive operations (popcnt, lzcnt and tzcnt are the other “hard to emulate” exceptions, but the latter two are just slight tweaks of the existing bsf and bsr). Once you have them in mind, you find that have a lot of use in various bit-manipulation operations, as you’ve found here with interleaving.

          AMD is taking a bit of a risk here, IMO – it is possible that at least some interesting application will make use of these instructions in a critical loop, and the 18x worse performance (throughput) will come back to bite them. I can understand not implementing them though: it requires a novel and still somewhat expensive circuit to implement quickly (see here for example http://ieeexplore.ieee.org/abstract/document/4019493 ). In particular, it won’t have much overlap with other ALU circuits.

          AMD actually does really well on most of the other interesting BMI instructions: they can do 4 popcnt or lzcnt instructions per cycle, with a latency of 1 cycle, versus Intel’s 1 per cycle and 3 cycle latency. This is probably a reflect of the general observation that AMDs ALUs are more homogeneous.

          Developers are in a bit of a catch-22 now: should they use pdep/pext at all? At least before Ryzen, no AMD machines supported it, so at least if it was present at all, it was fast. Now you have Ryzen machines with BMI2 which nominally support, but it is terribly slow. Do you now make three code paths, one for non-BMI, one for Intel-BMI (fast pdep) and one for AMD-BMI (slow pdep)? Maybe you just force AMD to always use the slow non-BMI path that you probably already have? Some options have good outcomes, strategically for AMD and some bad. No matter what, I think this will further slow the use of these interesting instructions, which is unfortunate.

          1. they can do 4 popcnt or lzcnt instructions per cycle, with a latency of 1 cycle, versus Intel’s 1 per cycle and 3 cycle latency

            That’s amazing!

            1. Travis Downs says:

              Yes, it’s quite nice: you usually don’t need “so much” throughput for these options, but sometimes being actually able to count bits very quickly helps.

              In general people have noted that the Ryzen execution units are very homogeneous: many or most operations execute on all 4 integer ALUs (exceptions include multiplication, which is expensive in area). On Intel, common operations like add, sub, and bitwise operations execute on all 4 ALUs, but many other operations are restricted to 1 or 2 ALUs. This is probably an effort to optimize chip area. Usually it’s not a bottleneck, but it can be.

  2. Travis Downs says:

    When benchmarking very short operations like this, I think it would be good to clarify whether the numbers you report are the operation *latency* (the time a single operation takes from input to output) or something else, like inverse throughput.

    I think the lay definition of “how long something takes” corresponds more to latency, since we usually have a mental model of starting a clock, running an operation, stopping the clock, and observing the elapsed time. It’s why we say it takes 9 months to make a baby, even though the “throughput” is higher if you have multiple expectant mothers, or are having twins (it’s still 9 months, not 4.5 months/twin).

    When analyzing the raw instructions that make up a function, that distinction is very important: pdep, for example, has a latency of 3 cycles: it takes 3 cycles to complete. However it has an “inverse throughput” of 1 cycle: you can start a new pdep each cycle and each will complete 3 cycles later, but the total throughput is 1 per cycle.

    Your numbers here are “inverse throughput” – indeed you can’t use pdep to produce a result faster than 3 cycles, so it’s not possible you are measuring latency. I’m guessing the pdep technique will have a latency of 5 cycles: the critical path flows through whichever pdep instructions starts second (remember, only one starts per cycle), and then the final OR of the results.

    So what is actually important, lantency or throughput? Well it depends on the surrounding code, of course! If you’re just trying to find the interleaving of a large array of numbers, each operation is independent, and throughput will matter (and the SIMD solutions will make sense). If the operation is part of some critical path (e.g., you use this value to do a lookup in a hash map and then proceed to do further calculations on the result) then latency matters.

    Conventionally, compiler and performance analysis people have seemed to think latency is “usually” the one that matters for general purpose code, based on the observation that most code is bottlenecked on one or more dependency chains rather than by throughput limitations like the processor width or port utilization. Personally, I don’t usually make a call either way: it’s best to report both numbers and “let the user choose”. Often the best “latency optimized” function is different than the best “throughput optimized function”!

    Finally, note that the distinction between latency and (inverse) throughput largely goes away when describing a “large” function (usually one that has a longish loop inside). As the total latency grows the relative amount of overlap between parallel executions of the function goes to zero and so latency and throughput become the same thing (but conceptually, I think, people are usually reporting “latency” for such functions: the time one call takes).

    1. When benchmarking very short operations like this, I think it would be good to clarify whether the numbers you report are the operation *latency* (…) or something else, like inverse throughput.

      You are correct, but I think my post did hint that I was measuring throughput (here: “Suppose that you have a bunch of data points, is it worth it to use the fancy x64 instructions?”). That is, I refer to a batch interleave of many data points.

      You can compute the minimal latency from instruction latencies, port availability and data dependencies… (sadly, it is getting harder to do, not easier) but then the latency won’t be the same when the function gets inlined within a tight loop as when it gets called once. And if it is inlined in a tight loop, then you probably don’t care about the latency.

      So you have to refer to the latency of the function when it gets called once in a blue moon without inlining… That’s clearly dozens of CPU cycles… I am not sure how to measure it.

      1. Travis Downs says:

        I don’t agree about inlining: whether latency matters depends basically entirely on how you are using it (i.e., how the interleaving interacts with the surrounding code).

        Even if you are doing it “many” times, the latency is what matters is there is a critical dependency chain in your code that is the critical path. This is the case *more often than not*, and is the main reason why you’ll usually see an IPC of ~1 on most code when the CPU can in principle sustain an IPC of 4.

        For example, let’s you use the result of the interleave operation to look up in a structure the next value to interleave. In this case each interleave is dependent on the prior (admittedly, I just made this up, but such things are common in practice, even if not with bit-interleaving), so the latency matters. This is all independent of inlining: inlining doesn’t break the fundamental data dependency that exists.

        In, fact, inlining is totally orthogonal to the latency vs throughput discussion: the CPU has no problem running multiple functions in parallel, even if not inlined, so if you have a fundamentally parallel operation, it will be parallelized even within functions (the mechanics of call/ret do add a serial dependency on the top of the stack though, so if your functions are shorter than about 4 cycles, you’ll hit this limit).

        We don’t even need to be talking about “functions” – that’s a convenient term, but they can also just be snippets of assembly placed end-to-end in the binary or whatever. Conceptually it’s a function f(a, b) -> c, but I don’t necessarily mean a “call” instruction is involved.

        The good news though is that it is very easy to measure the latency of an “function” (or operation) – use the same framework you used to measure throughput, but ensure the output of one interleave operation is fed into the input(s) of the next. Then you’ll get a latency measurement.

        This also raises an interesting observation for functions with more than one input: the latency can be different depending on which input(s) the dependency “flows” through. Here the arguments are symmetric, so the time will be similar, but in some cases some arguments might be part of a longer dependency chain. A reasonable, simple, approach is to make *all* inputs dependent on the output.

        1. is the main reason why you’ll usually see an IPC of ~1 on most code

          Your statement is likely correct… but I submit to you that hot code runs at more than 1 instruction per cycle, typically.

          So if you take, say, a matrix multiplication routine, then it should run at much more than 1 instruction per cycle.

          Often, I get low instruction counts per cycle when there are hard to predict branches… or cache misses…

          One exception to this is integer division…

          The good news though is that it is very easy to measure the latency of an “function” (or operation) – use the same framework you used to measure throughput, but ensure the output of one interleave operation is fed into the input(s) of the next. Then you’ll get a latency measurement.

          So you are thinking about a function that is in a critical path, that is, it will receive its data late… and the processor won’t be able to do anything until the function complete its work.

          In a lot of work I do, the processor can reorder work around so that this does not happen.

          Optimizing compiler can also do crazy things. I am confident that you will agree that the story changes dramatically in the code below when I remove the noinline attribute:

          __attribute__ ((noinline))
          uint64_t somefunction(uint64_t a) {
              a /= 3;
              return a;
          }
          

          uint64_t somefunctionloop(uint64_t a) { uint64_t z = a; for(int k = 0; k < 32; k++) z = somefunction(z); return z; }

          I’m interested in continuing this discussion, but I think it would be most interesting with concrete code examples, so that I can better understand what you have in mind.

  3. Travis Downs says:

    Well when I say “most code runs at 1 IPC” I mean the “when measured, the average IPC over the entire run of many interesting programs tend to hover around 1”. So that metric is inherently “execution time weighted” – i.e., the numerator is total dynamic instruction count and the denominator is dynamic cycle count.

    That inherently already weights the measurement to hot code. I wasn’t claiming for example that most functions have an IPC of 1 or that weighted by code size typical IPC is 1 – that’s not interesting as you point out.

    I have no doubt matrix multiply has a much higher IPC than 1 – that’s a classic mostly-embarrassingly parallel problem, after all. In practice it pretty much scales directly with FMADD throughput (as long as there are enough registers to hide the latency).

    In general, HPC codes probably tend to have high IPC (at least when they aren’t bandwidth limited) and also to see a bigger uplift from SIMD.

    I’m not thinking in terms of functions at all really: functions are a higher-level language concept – they mostly “disappear” in the instruction stream seen by the CPU after all – perhaps there is a trace in terms of a call/ret pair, or a jmp, or nothing at all if inlined, but the CPU doesn’t work “a function at a time” – it can overlap the work of many functions, and even when one function depends on an earlier one, it could start work on parts of the second function that don’t depend on the first.

    You are right that the reason for IPC less than “max” has a lot of possible reasons, such as cache misses and branch mispredicts (there is a great paper that I haven’t been able to find again that actually breaks down exactly all the factors that go into less than max IPC for a variety of bottlenecks, using Intel’s TopDown methodology IIRC, but I can’t find it) – but some of those are can still be characterized as latency problems!

    For example, if your application is heavily impacted by branch mispredicts, a huge lever on the performance will be how fast you can get to the “next” branch, so that it can be resolved and you can get back on the right path. The penalty for even dense branch mispredicts can be zero if you get to them fast enough and resolve them early such that front-end stays ahead of some backend-limited critical path (this isn’t theoretical, you can easily show this in a benchmark).

    …. all that said, yes your example is exactly how you’d test the latency of some function. Yes, optimizing compilers can do crazy things, but they won’t be able to break *true* data dependencies. Fake benchmarks often have a problem is that they are written in a way that there aren’t true dependencies, so optimization may indeed change the nature of the problem. Overall though I think optimization is orthogonal to this discussion.

    In the specific example, you gave, I agree that optimization _could_ change the behavior entirely, because a smart optimizer could entirely unroll the loop and inline the call so you end up with z = a / 3 / 3 / 3 / 3 / 3, which in turn is simplified to z = a * 3 * 3 * 3 and in turn to a * 1853020188851841, and indeed clang does more or less that. So the whole complexity of the problem was more or less brought down to zero via mathematical transformation – but how does that inform the discussion regarding latency and throughput? The example is “too fake” unfortunately. Another way to look at it is that my claims are based on the final “after optimization” stage of the compiler…

    I can put some code examples, but how do you format code on this blog?

    1. The example is too fake unfortunately. * I was teasing you. *I can put some code examples, but how do you format code on this blog? I use tohtml.com myself to convert code to HTML.

  4. Travis Downs says:

    I see, HTML is allowed.

    Let’s take a quick look at very slightly tweaked example, where the compiler can’t remove the substance of the benchmark, and compare “latency” and “throughput” measuring versions:

    static inline uint64_t somefunction(uint64_t a) {
    return 10000 / a;
    }

    uint64_t div64_lat(uint64_t iters, void *arg) {
    uint64_t z = (uintptr_t)arg;
    for(uint64_t k = 0; k < iters; k++)
    z = somefunction(z);
    return z;
    }

    uint64_t div64_tput_templ(uint64_t iters, void *arg) {
    uint64_t z = 0;
    for(uint64_t k = 0; k < iters; k++)
    z += somefunction(k + 1);
    return z;
    }

    The main difference is that somefunction has been flipped so the non-constant is now the denominator. This prevents optimization (in practice, today) because compilers don’t perform strength reduction to multiplication when the denominator is variable (and the strength reduction was the starting point for clang to remove the loop entirely). Also, the number of iterations is passed as a parameter, to make it more convenient for me to benchmark.

    There are now two variants of the looping function which repeatedly calls somefunction: a “lat[enncy]” bound version and a “t[hrough]put” bound version. They are very similar except that the latency version feeds the result from the prior iteration into argument for the next (as in your original example), while the throughput version sums the result from each iteration, but uses the loop counter as input.

    At the assembly level they are even more similar! The latency version looks like:

    .L4:
    mov rax, r8
    xor edx, edx
    add ecx, 1
    div rdi
    mov rdi, rax
    cmp esi, ecx
    jne .L4

    … while the throughput version is:

    .L13:
    mov rax, r8
    xor edx, edx
    div rcx
    add rcx, 1
    add rdi, rax
    cmp rsi, rcx
    jne .L13

    The same number of instructions (7), and almost exactly the same number of instructions (the latency version has one mov instead of an add in the throughput version), and the same general structure. Of course, the effect is quite different since the source functions are quite different. The key difference is that the output of the idiv in the latency version feeds back into the input of the next idiv, while in the throughput version it doesn’t.

    Let’s benchmark them, with both an inlined version (as shown above) and a non-inlined version. In the latter case there will be a call instruction in the loop, calling the out-of-line somefunction like this:

    <somefunction_noinline(unsigned long)>:
    xor edx,edx
    mov eax,0x2710
    div rdi
    ret

    ; latency main loop
    mov rdi,rax
    call 88 <somefunction_noinline(unsigned long)>
    add rcx,0x1
    cmp r8,rcx
    jne 80 <div64_lat_noinline(unsigned long, void*)+0x10>

    ; throughput main loop
    add rcx,0x1
    mov rdi,rcx
    call bc <somefunction_noinline(unsigned long)>
    add rsi,rax
    cmp r8,rcx
    jne b0 <div64_tput_noinline(unsigned long, void*)+0x10>

    Again, the complexity is almost the same – the throughput version suffers in this case a single extra add in the main loop. Both loops call the same function which does the division.

    Let’s time these 4 variants:


    Benchmark Cycles UOPS_I
    Dependent inline divisions 37.23 41.38
    Dependent 64-bit divisions 37.27 44.38
    Independent inline divisions 25.54 41.38
    Independent divisions 26.39 45.39

    First note that the inline or not made almost no difference – the latency version was virtually identical, and the throughput version was slower by less than a cycle (about 3%). The second column shows the micro-ops executed, and you can see that the no-inline versions executed 3 or 4 more uops compared to the inline ones, about as expected.

    The latency version is considerably slower than the throughput version, despite nearly identical instructions, because idiv is partially pipeline: it has a latency larger than the inverse throughput. idiv is actually quite complicated since it broken down into a long stream of micro-ops, so it’s hard to analyze exactly, but if you replaced with a single-uop multiply instead, you’d expect then a 3-to-1 ratio in the times of the two mechanisms.

    This is all down to the CPU executing things in parallel when it can, and also “seeing through” the function call in the not-inline case: it doesn’t have to wait for one function to finish before starting the next (indeed, you might have parts of 10 or more functions all running at once) for certainly types of code.

    The tests themselves can be found here:

    https://github.com/travisdowns/uarch-bench/commit/69b28c0d2c26ca17b33fa823a48f41b2d672688e#diff-a78c8aeb3e451bfb766f448d0529d873

    A godbolt link to play with:

    https://godbolt.org/g/ztqjTM

    1. Travis Downs says:

      Somehow all my carefully constructed HTML blocks (starting with pre tags), just show up as plain text. Is there an edit function (or a preview function)?

      Oddly the first time I submitted I got an internal server error, but I was able to save my work via the back button and post. I’m not sure if that was related.

    2. I think it is a fantastic example.

  5. Travis Downs says:

    test (this can be deleted)

    Benchmark Cycles UOPS_I
    Dependent inline divisions 37.23 41.38
    Dependent 64-bit divisions 37.27 44.38
    Independent inline divisions 25.54 41.38
    Independent divisions 26.39 45.39

  6. Travis Downs says:

    It doesn’t seem like the blog accepts/processes <pre> tags? tohtml uses them for all of its output, but based on the test they dont’ work (for me). Feel free to delete the posts above.

    1. My blog runs on a stock wordpress engine. You are correct, it seems, that pre tags are not allowed in comments.

      I once tried WYSIWYG plugins for the comments, and it ended up being a nightmare because people would complain that it does not work in their browser, and so forth.

      I’ll try again.

      1. So I enabled a MarkDown-based syntax thingy for comments. It seems to support code samples:

        for (int i  = 0; i <10; i--) {
          printf("hello");
        }