tokio: OOM/stack overflow crash observed when porting tokio 0.1 app to tokio 1.14 (uses tokio::process and grpcio)

Version Tokio 1.14.0

Platform Linux […] SMP Thu Dec 9 04:33:29 UTC 2021 armv7l GNU/Linux Rust 1.55

Description I’m going to do my best to report an issue I’ve been trying to track down for about the last month. I’ve been porting a medium size application from tokio 0.1 (rustc 1.46) to tokio 1.14 (rustc 1.55), and am now seeing rare but reoccuring OOM crashes; it looks like a stack overflow, but I have little to go by from looking at the core dumps (comes from a running armv7 device). The port itself is not very substantial since most of the API is the same, but this crash is new. It happens infrequently but common enough to suggest something is wrong, so I’m inclined to believe a race condition is involved.

I have been able to identify the moment it crashes is during tokio::process::Child::wait_with_output(), which I invoke like this:

let cmd = tokio::process::Command::new(...)...;
futures::future::lazy(|_| cmd.spawn())
  .and_then(|mut child| {
    let mut stdin = child.stdin.take().unwrap();
    stdin
      .write_all(...)
      .and_then(move |_| child.wait_with_output())
  })
  .then(|result| ...);

I’ve observed that if I reimplement wait_with_output(), but drop the stdout/stderr handles after wait() has finished, the crash stops. But Rust community Discord users have suggested to me that this is most likely just a symptom of the underlying problem.

One factor that may be contributing to this crash is that I am using gprcio alongside tokio in the application. I’ve attempted to take care to not mix executors, yet I am still seeing this issue. I’m going to try to explain how grpcio is used for some extra context in diagnosing the problem.

I am instantiating the tokio runtime using tokio::runtime::Runtime::new(), and then passing the Handle to the gRPC service handler, which handles requests like this (simplified):

struct FooService {
  app: Arc<Mutex<Application>>,
  handle: tokio::runtime::Handle,
}

impl FooRpc for FooService {
  fn foo(&mut self, ctx: RpcContext, req: FooRequest, sink: UnarySink<FooResponse>) {
    let (resp_tx, resp_rx) = futures::channel::oneshot::channel();
    let resp_future = self.app.lock().unwrap().do_foo(req);
    self.handle.spawn(
      resp_future.map(|resp| resp_tx.send(resp))
    );
    ctx.spawn(
      resp_rx.then(|resp| sink.success(resp.unwrap()).map(|_| ()))
    );
  }
}

impl Application {
  fn do_foo(&mut self, req: FooRequest) -> impl Future<Output = FooResponse> { ... }
}

The gRPC server is started by calling grpcio::Server::start() in main() (not using #[tokio::main]), which then blocks to run the event loop by passing a futures::channel::oneshot to Rutime::block_on() that is only ever completed after ctrlc signal is caught.

Let me know if you need any other details and I will try my best to provide them.

Misc. notes:

  • There is no unsafe code.
  • I have unfortunately not been able to reproduce this crash on my local machine or without running a sophisticated stress-testing routine to provoke it.
  • While it looks like the crash is a stack overflow, the program uses very little memory under normal operating conditions, then suddenly crashes. I have tried to monitor the stack size with stacker but the logging statements inserted were making the crash significantly less frequent, leading me to further believe it is due to a race condition.
  • Users on the Rust community Discord suggested that this behavior could be related to tokio’s “coop budgeting” interacting badly with how grpcio polls futures. I’m not an expert on executor internals so the implications here are beyond me.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

A version of communicate() which drops the stdout/stderr handles inside read_*_fut block also causes the crash to happen (at at that point it is nearly identical to wait_with_output())