Daniel Lemire's blog

, 16 min read

Avoid exception throwing in performance-sensitive code

18 thoughts on “Avoid exception throwing in performance-sensitive code”

  1. E.S. says:

    To make the cost of exceptions more explicit, consider Java. When an exception is thrown, a full backtrace is stored. This involves quite some memory allocation.

    Now if you are in a language with elegant and efficient optionals such as Rust, you can get some very nice code to handle nonstandard cases.
    But things get more tricky when you need to, e.g., parse numbers from a stream. What if you expect a number, but it’s syntax is invalid? In Java, the method then no longer can return a native int, but you would need to trust the hotspot compilers escape analysis to eliminate all overhead from returning a double boxed Optional.
    Some tools (e.g., ELKI https://github.com/elki-project/elki/blob/146bcb9fbc428e9c4bccdfde3e6f17ae18a38ebd/elki-core-util/src/main/java/elki/utilities/io/ParseUtil.java#L162 ) use preallocated exceptions without a stack trace in performance critical paths, where exceptions are somewhat needed to not pay other overheads (boxing) and a lot of boilerplate code to unbox and handle the exceptions yourself.
    I don’t know if something similar would be possible in C?
    The Rust solution here seems very nice and elegant to me.

    1. do NOT use Java or any other interpreted languages.
      do NOT implement on any architecture which does not hardware support for integer underflows or overflows.
      do NOT use IEEE floating point EVER, when the number can underflow or overflow the IEEE values unless you REALLY, REALLY (really) understand the implications. 99% of folks do not understand said implications

      1. steve says:

        Java is not an interpreted language

        1. Josh Enders says:

          I’m admittedly not deep in the Java ecosystem but my understanding is that the Java runtime (coretto/oracle JRE) is still an interpreter with a JIT.

      2. Carlos says:

        Stop living in the 90s.

  2. Champok Das says:

    I think that optimizing for the most common path and then adjusting for the extraordinary path is the best way to do things.

    Exceptions should be rare, building the context for an exception is done dynamically as far as I can remember. There is a fair amount of discussion about exactly this topic here on StackOverflow: https://stackoverflow.com/questions/13835817/are-exceptions-in-c-really-slow

    To build onto the comment of E.S., here is more historical context as to why C++ does not store the context except precisely where an exception occurs: https://stackoverflow.com/questions/3311641/c-exception-throw-catch-optimizations

    The current accepted answer as of May 16, 2022 links to the clang-dev mailing list describing why exceptions are done the way they are: https://lists.llvm.org/pipermail/cfe-dev/2015-March/042035.html

    Lemire does make an excellent point and I wholly agree.

    I have rather harsh opinion on error codes, mostly for the meat space aspect where context may not easily be understood or there is cascade of missing context(s) on the scheme of returning codes up the stack of a call that goes deep. But that is probably a discussion for a later time.

  3. Joern Engel says:

    The result of “0.05 ns/value” is highly dubious. If your test machine has 5GHz and you manage to execute 4 instructions per cycle, you still consume 0.05 ns/instruction. The code looks like it would take more than 1 instruction per value.

    You use the result, so the compiler probably doesn’t remove the entire loop. But the core looks simple enough for auto-vectorization. Depending on your view, that may make the comparison unfair.

    Then again, even translated to scalar code I would expect throughput around 1ns/value, still 500x faster than the exception variant.

    1. I am not sure why you are saying that the result is dubious. You would expect most compilers to autovectorize the exception-free routine. That’s the desired output.

      1. me says:

        I understand his argument that you end up measuring two effects at the same time: the (in-)ability to auto-vectorize this simple code and the overhead of exception handling. And that it would be better to test an example where no auto-vectorization happens when you’re interested in the overhead of using exception traps itself.

        1. Please propose a benchmark which measures solely “the overhead of exception handling”.

          1. me says:

            I am not an expect on microbenchmarking C.
            Maybe some compiler hint to prevent auto vectorization,
            __attribute__((optimize(“no-tree-vectorize”)))
            and a more expensive computation inside (say, sqrt or cos) helps. And a third version with no if statement, so we can measure the overhead of if/exception vs. branchless. Though I would expect branchless to not make much of a difference because of CPU pipelining. It if we get the 500 ns/value again, we know it’s reliable. Then try with positive/negative input data only to see if the cost is asymmetric. Because if it’s highly asymmetric, it may still be fine to use for low frequency case handling.

      2. Joern Engel says:

        Maybe “surprising” would have been a better term. Typically when I see performance numbers far outside my expectation, there is a bug somewhere. In this case the answer is auto-vectorization, so not a bug.

        Whether it is a fair comparison is a matter of debate. Both answers seem defensible. Anyway, thank you for the interesting benchmark.

  4. Luke Peterson says:

    I randomly stumbled across this article and decided to measure it using Google’s benchmark library. I did it locally and on Quickbench and did not observe the same results as you. You can take a look here:

    https://godbolt.org/z/EohfcxMx6

    Click on the “Quick-bench” link above the source code, and then “Run benchmark” on quick bench website. Alternatively you could build this locally and test, though you do need to ensure you turn of cpu throttling to get accurate numbers.

    If you haven’t already, Niall Douglas’ outcome and status-code are worth checking out (as well as std::expected), Niall has done some interesting benchmarks:

    https://www.boost.org/doc/libs/master/libs/outcome/doc/html/faq.html#is-outcome-suitable-for-fixed-latency-predictable-execution-coding-such-as-for-high-frequency-trading-or-audio

    https://github.com/ned14/outcome
    https://github.com/ned14/status-code

    1. Your benchmark measures the time needed to sum empty arrays. Unsurprisingly, it is a flat tiny cost.

      Try with the following routine.

      for (size_t i = 0; i != 10000; i++) {
      data.push_back((rand() % 1000) - 500);
      }

      Google Benchmark is convenient, but it does not do anything magical. This being said, tools like Google Benchmark do report the effective CPU frequency when they can so any throttling would be detected. In this instance, it is highly unlikely you will experience throttling on a reasonable platform. In any case, the result is so plain that no sophistication is needed to measure it.

      Thanks for the links.

      1. Luke Peterson says:

        Oops! You are correct, I’ve updated the example here:

        https://godbolt.org/z/Yq3oh9oo7

        On this updated benchmark on Quickbench, the with exception version is 450 times slower than the without exception version.

        I agree that Google benchmark doesn’t do anything magical, but it does make sharing benchmarks much more concise and does offer quite a few nice utility functions/macros that are annoying to re-implement.

        1. The next blog post chronologically reports on benchmarks written with Google Benchmarks…

          https://lemire.me/blog/2022/05/25/parsing-json-faster-with-intel-avx-512/

  5. Geert Bosch says:

    Some time ago, I optimized a very performance sensitive data validation routine. While the original code was good, it didn’t pay a lot of attention to optimizing the common path and separating the more difficult cases. In addition there were some places where we could replace branchy code by small table lookups. I expected gains there.

    The code also returned a status that was either OK or some error code. Mostly for code readability and I-cache footprint, I decided to remove the explicit status handling in favor of throwing exceptions on (very rare) exception cases where the input fails validation. To my surprise, the performance benefits were large (like 2x), even for the Google benchmark loops that have great cache locality. The knock-on benefits of telling the compiler (“this will not actually happen”, which implicitly confers a will-not-return attribute and puts all code dominating the basic block in question in a cold segment) are tremendous. If gives the compiler information that it can propagate forward and backward through the code: it serves either as a challenge to prove an assertion (and then use that for optimization purposes), or leave a dynamic check with a “known” outcome, resulting in great code on the happy path. In all, validation speed improved by about an order of magnitude, something inconceivable without taking advantage of never-taken exception paths.

    (The code in question is in the validateBSON routine of the MongoDB server.)

  6. Geert Bosch says:

    To add to my previous comment, a corollary to your post would be: “Avoid returning error codes from performance sensitive code.” There’s great benefit in focusing on the happy no-error path, and entirely separating out error-handling code.

    For the input validation part, we expect it never to fail in production code. However, we can’t afford to omit validation. While true zero-cost error handling may be a holy grail that we’ll never find, current C++ exception handling comes pretty close. Without a doubt, it’s a significant improvement over explicit error code handling, both for performance as well as code clarity.