ockam: Diagnose `Cannot drop a runtime ...` error in rust nodes

This error shows up often

thread 'tokio-runtime-worker' panicked at 'Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.', /Users/mrinal/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/blocking/shutdown.rs:51:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

We love helping new contributors! If you have questions or need help as you work on your first Ockam contribution, please leave a comment on this discussion. _If you’re looking for other issues to contribute to, checkout issues labeled - https://github.com/build-trust/ockam/labels/good first issue or https://github.com/build-trust/ockam/labels/help wanted_

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 43 (43 by maintainers)

Most upvoted comments

@antoinevg and @mrinalwadhwa thank you so much! especially many thanks to @SanjoDeundiak who helped me in the internal details of how ockam works, and @twittner who actually fixed the issue!

Awesome work @pro465! 🏅

Thank you @pro465 👏

@pro465 Context is a pretty basic building block in our Worker system. So, Workers and Processors are some entities that run in the background. Each has corresponding WorkerRelay or ProcessorRelay to actually run it. And Context is an entity responsible for interacting with a runtime/sending messages. Each Context instance has its own reference to the runtime and an address. Each Worker and Processor has their instance of Context that they own, also you can create detached Context instances, that are not attached to any Worker or Processor. That been said, because Context is so basic block, it’s probably created many times during even simplest scenario (like ockam node list), and many of those instances are dropped during node shutdown, which is where probably issue is Is that helpful?

Hey @pro465 , I think the bug is in ockam_node and happens sometimes during node shutdown, it doesn’t depend on which command you run, because almost all commands:

  1. Spawn a node
  2. Do something
  3. Shutdown the node

Also, I think it’s connected to how we implement Drop for Context type, it has some tricks because Drop is sync, and we needed to do async stuff there. Could you please take a look there?

im not sure if it will work, but i’ll share a potential solution:

    /// Build and spawn a new worker relay, returning a send handle to it
    pub(crate) fn init(rt: &Runtime, worker: W, ctx: Context, ctrl_rx: SmallReceiver<CtrlSignal>) {
        let relay = WorkerRelay::<W, M>::new(worker, ctx);
-       rt.spawn(relay.run(ctrl_rx));
+       rt.spawn(async move {
+             relay.run(ctrl_rx).await;
+             // send stop ACK here (**after** the `relay` is consumed and dropped) 
+       });
    }

(and removing the send_stop_ack call in the run fn, of course)

@mrinalwadhwa i think i’ve found the problem

@SanjoDeundiak thank you for the detailed documentation! and yes its helpful

WorkerRelay::run is also sometimes involved: https://github.com/pro465/ockam/runs/7539189372?check_suite_focus=true#step:6:527

(note: both are from the same GA log)

https://github.com/pro465/ockam/runs/7539189372?check_suite_focus=true#step:6:558

ProcessorRelay::run is (sometimes?) involved, altho i could not find where is it called in following the functions called by ockam node list. perhaps it is the issue in the background-running ockam node create?

@mrinalwadhwa @SanjoDeundiak i think i have found the issue: Executor::execute’s block_future call to “join” user code does not seem to be working, and so the user code can continue executing even after the executor is dropped. when the Context is dropped in those user code, since it is in an async context AND the executor is already dropped (i.e, it is the last one to have the reference to Arc<Runtime>), it panics.

evidence: https://github.com/pro465/ockam/runs/7518405635?check_suite_focus=true#step:6:359

hi I’m back! you probably don’t remember me but i contributed before at least 1 time 😃

anyways, can i get a backtrace? (cuz compiling this will take a looong time)