rclcpp: ActionClient Crash: Taking data from action client but nothing is ready

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • iron
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Start action server

  • server shall accept goal, but not terminate it

Start action client

  • send goal

wait

Expected behavior

no crash

Actual behavior

after some time we get the exception terminate called after throwing an instance of 'std::runtime_error' what(): Taking data from action client but nothing is ready

Additional information

Using a multi threaded spinner, but the default callback group, I guess this one is mutual exclusive

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

This problem is hard to reproduce, you need a certain workload and a actions running with a multithreaded spinner.

As for the explanation of what is going on, this is a long one…

First we need to look at the multithreaded executor: https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp#L82-L100

The important point here is that get_next_executable is called with a lock and execute_any_executable(any_exec) is called without the lock.

Now we need to look at the function https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp_action/src/client.cpp#L266-L274 This function will be called by get_next_executable deep under the hood. The important point here is that the kind of next executable is stored in the global pimpl.

Next is the function https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp_action/src/client.cpp#L622 This function is called by execute_any_executable. As pointed out above, this call is not protected any more by a lock, and any other thread may have been calling ClientBase::is_ready in-between.

As one can see, the execute function uses the values in the global pimpl object to determine the kind of action that needs processing. This it the point were the race occurs, as the global pimpl may have been altered by another thread, leading to the “Executing action client but nothing is ready” exception.

  if (pimpl_->is_feedback_ready) {
  } else if (pimpl_->is_status_ready) {
  } else if (pimpl_->is_goal_response_ready) {
  } else if (pimpl_->is_result_response_ready) {
  } else if (pimpl_->is_cancel_response_ready) {
  } else {
    throw std::runtime_error("Executing action client but nothing is ready");
  }

Note, I also saw races between is_ready and take_data. Which are a mystery to me, as this should be protected by the mutex in the spinner. The patch also addresses these.

@jmachowinski, I was thinking about clearing waitable_triggered_handles_ along with other handles in clear_handles() as it’s called by the executor in the beginning of wait_for_work() routine. However I am not sure about the contract here as potentially the following is possible: collect_entities() adds a waitable in the list, it’s then moved to waitable_triggered_handles_ but not getting executed, on the next iteration the same waitable is not added into the list as it’s not in a group anymore hence it’s not executed while it was marked ready for execution on the previous iteration. I am not that familiar with the internals so cannot say for sure if such scenario is even a valid one or not. Something tells it’s not but didn’t want to have more issues so went with a check before adding a waitable instead of clearing the list. If someone familiar with the matter can confirm that then clearing the triggered list might be a better approach.

Thank you, @jmachowinski , for additional details. After getting the race condition caused by execute() fixed I continued to experience “Taking data from action client…” runtime error and at this point it wouldn’t be caused by any race condition as is_ready() and take_data() couldn’t be called by different threads of the multi threaded executor. So I kept poking around and found the actual cause for it - take_data() might get called multiple times for a single message in the queue.

Quick overview of involved functions:

  • get_next_executable() first checks if there is a ready executable by calling get_next_ready_executable(), if none then it calls wait_for_work() and runs get_next_ready_executable() again.
  • get_next_ready_executable() gets a next waitable from the memory strategy and calls take_data() on it. The memory strategy pops next waitable from the triggered waitables list which has an execution slot in its callback group - in case of mutually exclusive group no other callbacks are being executed at the moment.
  • wait_for_work() clears all handles in the memory strategy, collects all handles from registered nodes (this is when our action client is added), then updates the wait set and finally calls remove_null_handles() on the memory strategy. The latter checks what waitables are ready by calling is_ready() on each of them and moves the ready ones to the triggered waitables list.

Now what sequence leads to the multiple calls of take_data():

  1. The first call to get_next_ready_executable() returns nothing and wait_for_work() is called.
  2. It finds that an action client aka waitable has some data to process as its is_ready() returned true and it was moved to the triggered waitable list
  3. The second call to get_next_ready_executable() doesn’t return that waitable as something else is running in its callback group which is set to mutually exclusive mode or some other service was returned
  4. The loop continues either in the same thread or a different thread - it doesn’t matter
  5. Again the first get_next_ready_executable() is called and it returns nothing - our waitable is still blocked by the callback group and it’s being kept in the triggered list.
  6. wait_for_work() is called again and finds that our waitable is still ready hence it adds it again to the triggered list. Now we have two records for the same waitable.
  7. get_next_ready_executable() finally returns the first waitable from the triggered waitable list which already got take_data() invoked.
  8. In the same or another thread either step 5 or step 7 will get the second record of the waitable in question from the triggered list but this time take_data() will fail as it doesn’t know what message type to read.

So the actual problem must be in AllocatorMemoryStrategy as it blindly adds ready waitables to the triggered list. https://github.com/ros2/rclcpp/blob/e2965831d51e9be03470cb07f8721012afcade9b/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp#L123-L127

I ended up with a simple change to add a waitable only if it’s not present in the triggered list and it fixed the problem for me. This is in addition to a change in execute() method to not depend on class members for a message type.

I also tried your fix and works as well as it makes is_ready() to read data in place and not in take_data() anymore hence no duplicates in the triggered list possible.

Not sure what else might be impacted by the flow described above besides action client base - don’t know what else uses Waitable interface in rclcpp.