ray: [release] many_ppo failing: "There was timeout in removing the placement group "

What is the problem?

Weekly long running tests:

2021-09-12 04:13:06,836 INFO log_timer.py:27 -- NodeUpdater: ins_UBpk1QT7cF9Xpx681xfUQFYV: Got IP  [LogTimer=63ms]
Traceback (most recent call last):
  File "workloads/many_ppo.py", line 41, in <module>
    callbacks=[ProgressCallback()])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 711, in run_experiments
    callbacks=callbacks).trials
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 588, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 646, in step
    self._run_and_catch(self.trial_executor.on_step_end)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 396, in _run_and_catch
    func(self.get_trials())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 988, in on_step_end
    self._pg_manager.cleanup()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/placement_groups.py", line 296, in cleanup
    remove_placement_group(pg)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 122, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/placement_group.py", line 253, in remove_placement_group
    worker.core_worker.remove_placement_group(placement_group.id)
  File "python/ray/_raylet.pyx", line 1488, in ray._raylet.CoreWorker.remove_placement_group
  File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ray.exceptions.GetTimeoutError: There was timeout in removing the placement group of id eb068a103a8de03cd6ec0544849331b4. It is probably because GCS server is dead or there's a high load there.
(pid=44646, ip=172.31.89.10) 2021-09-12 04:14:16,933    WARNING trainer_template.py:186 -- `execution_plan` functions should accept `trainer`, `workers`, and `config` as args!
(pid=44646, ip=172.31.89.10) 2021-09-12 04:14:16,933    INFO trainable.py:111 -- Trainable.setup took 73.185 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=44646, ip=172.31.89.10) 2021-09-12 04:14:16,934    WARNING util.py:57 -- Install gputil for GPU system monitoring.

Failing after 1 hour (we see this sometimes in the nightly tests, too):

https://buildkite.com/ray-project/periodic-ci/builds/963#532e36d2-7d77-42a1-aa67-b6837f5a64f6 https://buildkite.com/ray-project/periodic-ci/builds/967#49b3cc43-6c31-4eed-9922-55744fe03ddb

cc @rkooo567 @ericl

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (15 by maintainers)

Most upvoted comments

œqqsorry about the delay Sangbin. there were a lot of fires to put out lately. I took a look at the test, the application code is extremely simple. it basically creates a trainer and 7 workers, run one training iteration, and tears everything down. It then does this 10000 times.

The testing script doesn’t have a leak, but the real application code here is the entire Tune and RLlib codebase. That becomes pretty hard to judge. We recently had a report RLlib will crash after maybe 12 hours of running, and we are wondering if there is a memory leak problem.

I can keep you updated.

gjoliver on Oct 4, 2021