tokio: One bad task can halt all executor progress forever

Version Tested 1.16, 1.17, 1.18.2

Platform Confirmed on Linux amd64 and M1 Mac

Description

If one task gets stuck in a busy loop, or doing blocking IO, it can prevent all other tasks from being polled forever, even with the multi_thread executor. All that seems to be required is for the bad task to jump threads once, which seems to happen fairly randomly.

This is a serious issue because:

  1. it can run for months without triggering, only for your whole program to freeze without being obvious as to why
  2. in any non-trivial program, you’ll have enough tasks and dependencies that you can’t reasonably guarantee none will ever block/busy loop

I tried this code:

use std::thread;
use std::time::Duration;

//#[tokio::main]
#[tokio::main(flavor = "multi_thread", worker_threads = 32)]
async fn main() {
    let mut handles = Vec::new();

    handles.push(tokio::spawn({
        async {
            loop {
                println!("{:?}: good still alive", thread::current().id());
                tokio::time::sleep(Duration::from_secs(10)).await;
            }
        }
    }));
    handles.push(tokio::spawn({
        async {
            let orig_thread_id = format!("{:?}", thread::current().id());
            loop {
                println!("{:?}: bad still alive", thread::current().id());
                thread::sleep(Duration::from_secs(10));
                loop {
                    // here we loop and sleep until we switch threads, once we do, we never call await again
                    // blocking all progress on all other tasks forever
                    let thread_id = format!("{:?}", thread::current().id());
                    if thread_id == orig_thread_id {
                        tokio::time::sleep(Duration::from_secs(1)).await;
                    } else {
                        break;
                    }
                }
            }
        }
    }));

    for handle in handles {
        handle.await.expect("handle await");
    }
}

I expected to see this happen:

With 32 threads available, you’d expect one to be blocked, but both tasks should print messages and proceed.

Instead, this happened:

ThreadId(27): good still alive
ThreadId(23): bad still alive
ThreadId(2): good still alive
ThreadId(2): bad still alive
ThreadId(2): bad still alive
ThreadId(2): bad still alive
*snip*

good still alive will never print again

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 19
  • Comments: 30 (16 by maintainers)

Commits related to this issue

Most upvoted comments

One thought: We don’t necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its park call to allow it to steal the IO driver if the other thread is blocked.

Along the lines of @Darksonn’s suggestion here…if a worker is currently holding the IO driver and it’s transitioning to start polling its own tasks, shouldn’t it try to wake a parked worker to steal the IO driver? It seems like, even if none of the tasks on the worker that’s holding the IO driver will block, there’s still potentially an entire scheduler tick of latency until IO is polled again, and we could avoid that by eagerly giving the IO driver to a parked thread…

Well, one challenge with per-worker solutions is that if that ties each resource to a specific core, then that resource will not receive wakeups if that core is blocked.

One thought: We don’t necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its park call to allow it to steal the IO driver if the other thread is blocked.

Also, for anyone who wants a solution now, you can spawn your own monitor thread. See here for an example.

is tokio going to fix this or what?

My issues as mentioned in Januari/februari in this thread (related to mixing of Synchronous and Asynchronous code were resolved by https://github.com/cvkem/async_bridge (generic solution I developed to make mixing of code easier. This helped me to resolve my issues back than and served me well during further development. I added documentation and learnings and I hope this also helps others forward.

This feels somewhat similar to Spectre and Meltdown where Intel was “caught speeding” – their optimizations were too aggressive for the happy-path (i.e. CPU only loaded with trusted, well-behaved programs) and that opened the door to other issues. Similarly, everything is great with tokio, until it’s not.

Common in software engineering, it’s not so much the performance or operation of a well-behaved program that matters, but dealing with the exceptional-cases or nefarious actors that requires so much thought. This might be where we’re at here.

It would be interesting to know how much of a performance hit the potential solutions have and if it would be possible to feature-flag it. We could benchmark our own app, though it’s probably not representative of other workloads, and we wouldn’t be able to take this work item until the new year.

Look, it isn’t so simple. All potential solutions have disadvantages to them. They hurt the performance of well-behaved programs, and also raise questions about forwards-compatibility.

If you want to help, you could investigate how other runtimes handle this problem. For example, what does (did?) Go do if the thread with the epoll instance gets blocked by a tight loop without yield points, and all other threads are asleep? (Runtimes that poll IO from a dedicated thread are not relevant. We are not going to do that.)

As far as I can tell, there is no bug in tokio. The example code could have happened before it just was far less likely due to erroneous thread wake ups that happened and were removed in 1.16.

The issue here is fundamentally, when one accidentally blocks the runtime, it would be nice to know that something went wrong and gracefully tolerate it. Currently, Tokio does not offer a way to do this reliably, it just “happened” to do this in some cases earlier.

The only reliable way to tolerate accidental blocking that I can think of is to have a dedicated thread to monitoring the state of each worker thread. If a thread gets stuck for too long, this monitor thread could warn & move all tasks currently on the stuck thread to a new thread.

As @darkson mentioned, there is no obvious solution to this that comes without additional overhead. Fundamentally, Tokio is a cooperative multi-tasking system, which puts some responsibility on the user. As such, I will close this issue as there is no “bug” in Tokio.

That said, there would be value in detecting and possibly mitigating poorly behaved tasks on an opt-in basis. I opened #6315 as a tracking issue and referenced this issue. This will help reframe the conversation.

I’ll +1 this.

This is a pretty ugly performance wart. Our team has independently hit this issue twice, where some singular task takes longer than the developer expects and this results in 100ms+ of latency for API requests that we otherwise expect to complete very quickly.

The fact that one pathological task is able to harm a bunch of otherwise good API requests is alarming behavior.

if a worker is currently holding the IO driver and it’s transitioning to start polling its own tasks, shouldn’t it try to wake a parked worker to steal the IO driver?

I wonder why we don’t create a dedicated thread to run the IO driver?

Our current approach results in very little synchronization overhead and good tail latency under load. The background thread approach is much worse on both metrics, and is ultimately far slower and less effective for production workloads.

Generally, the cause of the halting is a task that blocks the thread. You should be able to detect those using tokio-console. Tasks that block the thread show up as “busy duration” increasing while the number of polls remains constant.

I think it would be a good idea to try either of the solutions from @hawkw and @Darksonn and hide them behind builder flags on tokio_unstable kinda like with the LIFO slot. We could then have a way of looking at the performance impact on other types of applications without needing to first stabilize this behavior.

There haven’t been any updates.

Here’s my theory for the how the deadlock occurred in this example. The first thing to note is that even if multiple worker threads are idle, only of them is waiting for IO/timer events. The remaining threads are just sleeping using park. The way that the runtime decides who gets the IO driver is by putting it behind a mutex and having each thread call try_lock before going to sleep, sleeping using the driver if it succeeds. Threads that fail the try_lock call just sleep using park.

Now, given the above, here’s what I think happened:

  1. All tasks are idle. Thread 2 has the IO driver.
  2. Thread 2 is woken by the second task’s timer. The other threads continue to block on park. The first task is still idle.
  3. Thread 2 starts polling the second task.
  4. When the first timer expires, nobody notices because no thread is sleeping on the IO driver. No work stealing happens because the first task is idle, and will continue to be idle until the IO driver runs.