Daniel Lemire's blog

, 9 min read

The cost of runtime dispatch

11 thoughts on “The cost of runtime dispatch”

  1. Evan says:

    FWIW, on ARM you generally can’t use the equivalent of the Intel CPUID instruction in unprivileged code. On Linux you’re supposed to use getauxval(AT_HWCAP) and/or getauxval(AT_HWCAP2) (depending on which feature(s) you want to check for), which is obviously going to be a lot slower. I have no idea what you’re supposed to do on Windows.

    I have some code in portable-snippets for both x86 and ARM (on Linux); it’s not my best work, but it is functional. If you want something a bit beefier Google’s cpu_features library is probably your best bet right now, but integrating it is a bit of a pain unless you are already using CMake (and it’s a bit of a pain if you are already using CMake because, well, you’re using CMake ;)).

    If you don’t have to worry about supporting multiple compilers, there are lots of interesting options out there. GCC has the target_clones attribute. clang has a cpu_dispatch (I think ICC does too, but I’m not certain). Unfortunately stuff like that doesn’t work if you’re using preprocessor directives to switch between different implementations, and AFAIK MSVC doesn’t have anything similar.

    I think the much more interesting, and important, question is where in the code to do the runtime dispatching. Doing it at too low of a level means you’re performing a lot of extra checks and hurting the compiler’s ability to optimize. Doing it at too high a level means you end up with a lot of bloat. In my experience, if the cost of the check is a concern you should probably move it up a bit.

    For example, one question I get about SIMDe pretty often is whether it does dynamic dispatch. It would be very convenient, but it would also be absolutely devastating for performance. I’d be interested to hear about your experience with where to put the dynamic dispatch code in simdjson and why.

    1. I think that simdjson has a different design issue with respect to runtime dispatching than SIMDe because we can easily hide away the runtime dispatching without effort. Our user-facing API has few entry points.

  2. Stelian Ionescu says:

    GCC has support for function multiversioning, which allows the runtime linker to select the best function and avoids all the calls to CPUID.

    1. It is unlikely to avoid “all calls to the CPUID”. Or do you mean that it removes them from your own code?

      1. Stelian Ionescu says:

        Yes, I was referring to the CPUID calls required by runtime dispatch. If the load-time CPU dispatch afforded by the toolchain does the job, it seems like a more maintainable solution.

        1. I agree!!! But I don’t think you take it far enough.

          Support should be in the programming language itself.

          1. Stefan Brüns says:

            C++ is specified for an abstract virtual machine.

            CPU architectures are an implementation detail of the compiler.

  3. Dmitrii Vedenko says:

    CPUID is a serializing instruction, which makes benchmarking it rather useless and informative. I.e. it will wait until all previous instructions are fully completed, which can be a major performance issue on the modern out-of-order CPUs.

  4. When using C++11 or higher, it’s sufficient to do something like

    static unsigned int cpuid = getcpuid()

    Doing this inside any function (as well as in global scope, although that can be prone to issues with static ctor order) is guaranteed to be thread-safe, so there’s no need to roll your own atomic-based construct for this.

    1. Certainly, there is no need to synchronize getcpuid itself, but whatever work you do following getcpuid also needs to be made thread-safe.

      1. Stefan Brüns says:

        You can put the whole initialization code into a lambda, and use it to initialize a static variable once. That *is* thread safe in C++11.

        “`
        const implementation *detect_best_supported() noexcept {
        const static implementation* best = []() {
        uint32_t supported_instruction_sets = detect_supported_architectures();
        for (const implementation *impl : available_implementation_pointers) {
        uint32_t required_instruction_sets = impl->required_instruction_sets();
        if ((supported_instruction_sets & required_instruction_sets) ==
        required_instruction_sets) {
        return impl;
        }
        }
        return &legacy_singleton;
        }
        return best;
        }
        `