rclpy: Race condition in `ActionClient.*_async` functions which result in a future that never complete

Bug report

Hi all, thank you for maintaining this wonderful library!

I’ve been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I’ve got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I’ll lay out all my findings and some perceived paths forward in this bug report.

Required Info:

  • Operating System:
    • Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
  • Installation type:
    • Binaries as installed from ros2 apt.
  • Version or commit hash:
    • rclpy==3.3.7, python3.10
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclpy==3.3.7, python3.10

Steps to reproduce issue

There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async. Take the ActionClient.send_goal_async and add a sleep(#) directly below the line sequence_number = self._client_handle.send_goal_request(request). For example:

    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        ...
        """

        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)
        sleep(5)   # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future

Expected behavior

I would expect ‘send_goal_async’ to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).

Actual behavior

The future never completes, and I see the following line in the logs

Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'

Additional information

By my sleuthing, it appears that this error is caused by the following important lines being called:

client:Thread1

  1. send_goal_async is called, and within it:
  2. sequence_number = self._client_handle.send_goal_request(request)
  3. self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id

client:Thread2

  1. A response to a goal request is received in the client C bindings, causing it to return True to is_ready(...)
  2. execute(...) is called
  3. if sequence_number in self._goal_sequence_number_to_goal_id: evaluates to False, because client:Thread1 has not reached its step 3
  4. It ignores the taken_data, and thus the future will never have set_result called.

server:Thread1

  1. A goal request is received, and responded to

Cause Summary

There is a race condition when sending goal requests, because the goal request can complete before the sequence_number is tracked in the self._goal_sequence_number_to_goal_id dictionary. Thus execute is called before it is tracked, and the taken_data is ignored, assumed to be a response from a duplicate action server.

I believe this same race condition affects all usages of the client API, such as _cancel_goal_async, or _get_result_async.

Recommendations

  1. Put a lock around every usage of self._client_handle so that python-side objects can be edited and synced up. That way if execute is called it will block until the self._client_handle isn’t being used, and the _goal_sequence_number_to_goal_id will be filled in.
  2. Alternatively, create locks for the various in-memory dictionaries that are used in execute:
  • _goal_handles
  • _pending_goal_requests
  • _goal_sequence_number_to_goal_id
  • _pending_cancel_requests
  • _pending_result_requests
  • _result_sequence_number_to_goal_id

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 20 (2 by maintainers)

Commits related to this issue

Most upvoted comments

To others who think they might be running into this issue, here’s a quick hack you can use that doesn’t require building rclpy:

class PatchRclpyIssue1123(ActionClient):

    _lock: RLock = None  # type: ignore

    @property
    def _cpp_client_handle_lock(self) -> RLock:
        if self._lock is None:
            self._lock = RLock()
        return self._lock

    async def execute(self, *args: Any, **kwargs: Any) -> None:
        with self._cpp_client_handle_lock:
            return await super().execute(*args, **kwargs)  # type: ignore

    def send_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super().send_goal_async(*args, **kwargs)

    def _cancel_goal_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._cancel_goal_async(*args, **kwargs)

    def _get_result_async(self, *args: Any, **kwargs: Any) -> Future:
        with self._cpp_client_handle_lock:
            return super()._get_result_async(*args, **kwargs)

Do you think that the difference might be in the executor? Come to think of it, I’m guessing this would only occur in a MultiThreadedExecutor case, since a SingleThreadedExecutor wouldn’t be able to have a race condition like this.

Yeah, that very well may be the case. If there is a simple example I can use to try to reproduce it, I’m certainly happy to take a look and review a PR.

I was looking at the code for service clients and it also uses a lock to sync the C++ state (self.__client calls) with the python state (self.__pending_requests). I think this highlights that the issue on the action client side is just an oversight, and that using locks is the way to go.

I’ll submit a PR!

I don’t fully understand how a lock doesn’t fix the issue in this case. Are you certain you were running the code with the lock when you observed this issue? 😅. It guarantees the dictionaries are updated at least in the send_goal_async case, because the execute function can’t receive a sequence number until the client has been used in send_goal_async (which is inside a lock). Unless I am misunderstanding this race condition?

I have a similar fix on my end (I subclass ActionClient and lock the entirety of send_goal_async and execute) and it has solved the problem for me.

You are right, lock does seem to solve the issue. Once I fixed the issue caused by the race condition I had another unrelated bug in my setup which was misleading me to believe that the issue is not fixed.

@adityapande-1995. Adding another viewpoint here: the dictionaries like _goal_sequence_number_to_goal_id are being edited and read in two threads: the thread that user is sending the goal request in and the thread in which the node spins which calls async def execute. So the race condition here is evident from the code itself.