Daniel Lemire's blog

, 13 min read

Reusing a thread in C++ for better performance

15 thoughts on “Reusing a thread in C++ for better performance”

  1. KasOb. says:

    Hi Daniel,
    (Been following this blog for years, yet this is my first comment.)

    Every now and then and for very long time, this subject was intriguing me, i have similar result like yours, and that is not the question, the question is why CPU technology/industry is and was mostly driven by the need for more speed and more cores, yet it doesn’t focus or even ignore this exact point, switching between thread, i mean GPU’s have hundreds cores, CPU’s already have tens, yet having specific instruction like similar to HLT (halt) to be waked up by another instruction or dedicated instructions set to time very short sleep to save power, this might be very useful and will boost the speed in cases or save power in different cases,
    Why the need of switching between threads in an efficient way seems to be like not important or not a priority?

    For me it looks like been decided, this one is a software issue to resolute or to live with, yet those CPU technologies do evolve to hasten specific software problems, may be it is hard or wrong to do in hardware, may be, on other hand seeing what was considered to be hard or impossible 15 or 20 years ago (or even more) , in a device you can hold in one hand does means one thing, that hard and impossible are relative matter and not absolute.
    Is it wrong to begin with ? or just wrong now in relative to our time, and this can be seen differently in few years.

    Daniel, i would love to read your opinion and thoughts about that, may be blog post.

    1. Jorge says:

      switching between threads in an efficient way seems to be like not important or not a priority

      It is very application dependent. In HPC (scientific computing) programs typically pin one thread to each core so they don’t disturb each other, meanwhile operative systems are optimized to minimize the noise introduced by other applications taking CPU time.

      yet having specific instruction like similar to HLT (halt) to be waked up by another instruction or dedicated instructions set to time very short sleep to save power

      In Intel processors you already got something like that. The instructions monitor and mwait track a memory location and put the core in a low power state. The problem is that it is processor specific and not portable to other platforms.

    2. nicoo says:

      Hi KasOb,

      There’s definitely a lot going on in CPU technology to reduce the cost of concurrency and context switching:

      Hyperthreads are definitely the most well-known: the CPU exposes a single core (with a single set of execution ports) as a pair of “logical” cores to the OS, which can schedule 2 different tasks on it; the CPU executes both tasks interleaved, and whenever one task blocks (for instance, due to a cache miss or atomic memory operation, or if it’s spinning over a lock and signals it with mm_pause) the other task can run.
      In a more-traditional system (no HT, software scheduler) the cycles that the task spent blocking would simply be “lost” (no useful work happening).
      New concurrency-related hardware features (lock elision, hardware transactional memory, …) enable faster implementations of locks/semaphores, work queues, etc…
      Those hardware features are not really consumed directly by most software engineers, as they require very specialised knowledge to use effectively, but libraries of high-performance concurrency primitives tend to leverage them.
      On ARMv8 CPUs, the NVIC (Nested Vectored Interrupt Controller) supports fairly complex/flexible task configurations.
      For instance, the RTIC (Real-Time, Interrupt-driven Concurrency) framework reduces a program’s scheduling policy (i.e. the relative priorities of various tasks) to an NVIC configuration at compile time, meaning that all context switching and task management is managed by the hardware, rather than having a software scheduler. Cherry on top, RTIC extracts information about which resources are used by each task, to both avoid unnecessary locks (if a task uses a given shared resource, but no higher-priority task does, it can safely avoid taking-and-releasing the lock) and avoid unnecessarily blocking (when a task A is in a critical section, only tasks which use some of the same resources are blocked; higher-priority tasks that do not interact with A can still preempt it as needed).
      I’m not aware of any general-purposed OS doing this, though. 🙁

      1. KasOb. says:

        Thank you Nicolas,

        What did you described about ARMv8 is in fact very interesting (i didn’t know that) , also reading that Apple will release its Mac with ARM processors in 2021, indicate the processing technology race is not slowing down on contrary it is picking up pace.

        The Cherry you mentioned, IMHO it makes sense to be used to simplify the multi-reader single-writer implementation (may be for multi-writer in atomic behaviour !), to provide higher level of efficiency with lower power consumption.

        Thank you again for replying with these information.

  2. Vk3y says:

    Hope you will do optimization research on JavaScript 😭 plz

  3. Rozenberg, Eyal says:

    Why implement your own 1-thread thread pool? Just use an existing library. DuckDuckGo Search.

    1. Why implement your own 1-thread thread pool

      To run benchmarks so that we can understand what the trade-offs are.

  4. In our case, since the operating system closes a thread down in its own time, we quickly ran out of threads using the first approach. Re-using the thread was the only workable solution.

    1. It is intriguing. Did you join your threads and still get the problem? I am hoping that once the call to join succeeds, the thread is gone. Callign detach would be something else… but I hope that “join” actually cleans the thread up…

  5. Rudi says:

    The spinlock approach is something that should be avoided by any means. Especially on single core machines this will effectvely kill the performance of the whole system. I would never ever do that!

  6. Ryan Olson says:

    I like to use a ThreadPool for such circumstances.

    This is a nice implementation.


  7. Ryan Olson says:

    PS – subscribing without comment is broken.

    1. PS – subscribing without comment is broken.

      I am not sure what this means. Can you elaborate?

  8. nicoo says:

    Some quick observations you might not be aware of:

    when spinning on a lock, it’s usually a good idea to emit an instruction signalling that to the CPU (mm_pause on x86/amd64, yield on Arm) : it enables optimisations such as switching to another hyperthread on the same core when waiting for the lock, or going low-power (modern CPUs are often bottlenecked by heat management, so going low-power can let other, useful work happen at a higher clock frequency)
    good mutex and work queues implementations already spin for a short while (to optimise away the context switch when duty cycle is high) before parking the thread (typically using a futex, so the OS scheduler knows exactly when to wake up a thread as work becomes available) ; I wasn’t quite capable of figuring out what the GNU libstdc++ does, from reading the relevant code, but it seems not to do spin-then-futex for some reason.
    in more general work-queue usecases, using a spinlock alone is susceptible to priority inversion: if some thread gets interrupted in the critical section, the OS might schedule the other threads (that are spinning uselessly) instead of the one holding the lock.

  9. Ryan Olson says:

    I couldn’t get it to work. The validation logic required a value in the message.