rclpy: Race condition in `ActionClient.*_async` functions which result in a future that never complete
Bug report
Hi all, thank you for maintaining this wonderful library!
I’ve been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I’ve got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I’ll lay out all my findings and some perceived paths forward in this bug report.
Required Info:
- Operating System:
- Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
- Installation type:
- Binaries as installed from ros2 apt.
- Version or commit hash:
- rclpy==3.3.7, python3.10
- DDS implementation:
- CycloneDDS
- Client library (if applicable):
- rclpy==3.3.7, python3.10
Steps to reproduce issue
There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async. Take the ActionClient.send_goal_async and add a sleep(#) directly below the line sequence_number = self._client_handle.send_goal_request(request). For example:
def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
"""
...
"""
if not isinstance(goal, self._action_type.Goal):
raise TypeError()
request = self._action_type.Impl.SendGoalService.Request()
request.goal_id = (
self._generate_random_uuid() if goal_uuid is None else goal_uuid
)
request.goal = goal
sequence_number = self._client_handle.send_goal_request(request)
sleep(5) # <--- this line causes this race condition nearly 100% of the time
if sequence_number in self._pending_goal_requests:
raise RuntimeError(
"Sequence ({}) conflicts with pending goal request".format(
sequence_number
)
)
if feedback_callback is not None:
# TODO(jacobperron): Move conversion function to a general-use package
goal_uuid = bytes(request.goal_id.uuid)
self._feedback_callbacks[goal_uuid] = feedback_callback
future = Future()
self._pending_goal_requests[sequence_number] = future
self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
future.add_done_callback(self._remove_pending_goal_request)
# Add future so executor is aware
self.add_future(future)
return future
Expected behavior
I would expect ‘send_goal_async’ to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).
Actual behavior
The future never completes, and I see the following line in the logs
Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'
Additional information
By my sleuthing, it appears that this error is caused by the following important lines being called:
client:Thread1
send_goal_asyncis called, and within it:sequence_number = self._client_handle.send_goal_request(request)self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
client:Thread2
- A response to a goal request is received in the client C bindings, causing it to return True to
is_ready(...) execute(...)is calledif sequence_number in self._goal_sequence_number_to_goal_id:evaluates to False, becauseclient:Thread1has not reached its step 3- It ignores the
taken_data, and thus the future will never haveset_resultcalled.
server:Thread1
- A goal request is received, and responded to
Cause Summary
There is a race condition when sending goal requests, because the goal request can complete before the sequence_number is tracked in the self._goal_sequence_number_to_goal_id dictionary. Thus execute is called before it is tracked, and the taken_data is ignored, assumed to be a response from a duplicate action server.
I believe this same race condition affects all usages of the client API, such as _cancel_goal_async, or _get_result_async.
Recommendations
- Put a lock around every usage of
self._client_handleso that python-side objects can be edited and synced up. That way ifexecuteis called it will block until theself._client_handleisn’t being used, and the_goal_sequence_number_to_goal_idwill be filled in. - Alternatively, create locks for the various in-memory dictionaries that are used in
execute:
- _goal_handles
- _pending_goal_requests
- _goal_sequence_number_to_goal_id
- _pending_cancel_requests
- _pending_result_requests
- _result_sequence_number_to_goal_id
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (2 by maintainers)
Commits related to this issue
- use the same callback group for ActionClient execution and send goal process to avoid the issue https://github.com/ros2/rclpy/issues/1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu> — committed to CMU-cabot/cabot by daisukes 10 months ago
- use the same callback group for ActionClient execution and send goal process to avoid the issue https://github.com/ros2/rclpy/issues/1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu> — committed to CMU-cabot/cabot-navigation by daisukes 10 months ago
- use the same callback group for ActionClient execution and send goal process to avoid the issue https://github.com/ros2/rclpy/issues/1123 Signed-off-by: Daisuke Sato <daisukes@cmu.edu> — committed to CMU-cabot/cabot-navigation by daisukes 10 months ago
To others who think they might be running into this issue, here’s a quick hack you can use that doesn’t require building rclpy:
Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I’m certainly happy to take a look and review a PR.
I was looking at the code for service clients and it also uses a lock to sync the C++ state (
self.__clientcalls) with the python state (self.__pending_requests). I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.I’ll submit a PR!
You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed.
@adityapande-1995. Adding another viewpoint here: the dictionaries like
_goal_sequence_number_to_goal_idare being edited and read in two threads: the thread that user is sending the goal request in and the thread in which the node spins which callsasync def execute. So the race condition here is evident from the code itself.