ray: [rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+
This is not a contribution.
Versions: python: 3.6.8 ray: 1.0 pytorch: 1.6 tensorflow: 1.15 OS: Ubuntu 18.04 Docker
Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with The actor died unexpectedly before finishing this task errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.
Here’s an example of the errors (produced by the script provided later):
2020-10-05 01:55:09,393\u0009ERROR trial_runner.py:567 -- Trial PPO_QbertNoFrameskip-v4_b43b9_00027: Error processing event.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: \u001b[36mray::PPO.train()\u001b[39m (pid=4251, ip=172.30.96.106)
File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
result = Trainable.train(self)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
result = self.step()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
res = next(self.train_exec_impl)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 791, in apply_foreach
result = fn(item)
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/metric_ops.py", line 79, in __call__
timeout_seconds=self.timeout_seconds)
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/metrics.py", line 75, in collect_episodes
metric_lists = ray.get(collected)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Here’s another variant of the error when running our own custom environment:
Failure # 1 (occurred at 2020-10-03_02-10-38)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::PPO.train() (pid=524, ip=172.30.58.198)
File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
result = Trainable.train(self)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
result = self.step()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
res = next(self.train_exec_impl)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 876, in apply_flatten
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 828, in add_wait_hooks
item = next(it)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 471, in base_iterator
yield ray.get(futures, timeout=timeout)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Here’s the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.
import copy
import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
if __name__ == '__main__':
ray.init(address="auto")
config = copy.deepcopy(ppo.DEFAULT_CONFIG)
config.update({
"rollout_fragment_length": 32,
"train_batch_size": 8192,
"sgd_minibatch_size": 512,
"num_sgd_iter": 1,
"num_workers": 256,
"num_gpus": 1,
"num_sgd_iter": 1,
"num_cpus_per_worker": 0.25,
"num_cpus_for_driver": 1,
"model": {"fcnet_hiddens": [1024, 1024]},
"framework": "torch",
"lr": ray.tune.sample_from(lambda s: np.random.random()),
})
trainer_cls = ppo.PPOTrainer
config["env"] = "QbertNoFrameskip-v4"
ray.tune.run(trainer_cls,
config=config,
fail_fast=True,
reuse_actors=False,
queue_trials=True,
num_samples=100,
scheduler=ray.tune.schedulers.ASHAScheduler(
time_attr='training_iteration',
metric='episode_reward_mean',
mode='max',
max_t=2000,
grace_period=100,
reduction_factor=3,
brackets=3),
)
One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure – which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it’s quite consistent:
import copy
import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
from ray.rllib.agents.ppo.ppo import UpdateKL, warn_about_bad_reward_scales
from ray.rllib.execution.common import STEPS_SAMPLED_COUNTER, _get_shared_metrics
from ray.rllib.execution.rollout_ops import ParallelRollouts, ConcatBatches, \
StandardizeFields, SelectExperiences
from ray.rllib.execution.train_ops import TrainOneStep
from ray.rllib.execution.metric_ops import StandardMetricsReporting
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.util.iter import from_actors
def custom_ppo_execution_plan(workers, config):
"""Copy of PPO's execution plan, except we store all ops in a list and return them."""
# Modified from ParallelRollout's bulk_sync mode.
workers.sync_weights()
def report_timesteps(batch):
metrics = _get_shared_metrics()
metrics.counters[STEPS_SAMPLED_COUNTER] += batch.count
return batch
ops = [from_actors(workers.remote_workers())]
ops.append(ops[-1].batch_across_shards())
ops.append(ops[-1].for_each(lambda batches: SampleBatch.concat_samples(batches)))
ops.append(ops[-1].for_each(report_timesteps))
# Collect batches for the trainable policies.
ops.append(ops[-1].for_each(
SelectExperiences(workers.trainable_policies())))
# Concatenate the SampleBatches into one.
ops.append(ops[-1].combine(
ConcatBatches(min_batch_size=config["train_batch_size"])))
# Standardize advantages.
ops.append(ops[-1].for_each(StandardizeFields(["advantages"])))
# Perform one training step on the combined + standardized batch.
ops.append(ops[-1].for_each(
TrainOneStep(
workers,
num_sgd_iter=config["num_sgd_iter"],
sgd_minibatch_size=config["sgd_minibatch_size"])))
# Update KL after each round of training.
ops.append(ops[-1].for_each(lambda t: t[1]).for_each(UpdateKL(workers)))
# Warn about bad reward scales and return training metrics.
return (StandardMetricsReporting(ops[-1], workers, config) \
.for_each(lambda result: warn_about_bad_reward_scales(config, result)),
ops)
class ExecutionPlanWrapper:
"""A wrapper for custom_ppo_execution_plan that stores all ops in the object."""
def __init__(self, workers, config):
self.execution_plan, self.ops = custom_ppo_execution_plan(workers, config)
def __next__(self):
return next(self.execution_plan)
if __name__ == '__main__':
ray.init(address="auto")
config = copy.deepcopy(ppo.DEFAULT_CONFIG)
config.update({
"rollout_fragment_length": 32,
"train_batch_size": 8192,
"sgd_minibatch_size": 512,
"num_sgd_iter": 1,
"num_workers": 256,
"num_gpus": 1,
"num_sgd_iter": 1,
"num_cpus_per_worker": 0.25,
"num_cpus_for_driver": 1,
"model": {"fcnet_hiddens": [1024, 1024]},
"framework": "torch",
"lr": ray.tune.sample_from(lambda s: np.random.random()),
})
trainer_cls = ppo.PPOTrainer.with_updates(
name="CustomPPO",
execution_plan=ExecutionPlanWrapper)
config["env"] = "QbertNoFrameskip-v4"
ray.tune.run(trainer_cls,
config=config,
fail_fast=True,
reuse_actors=False,
queue_trials=True,
num_samples=100,
scheduler=ray.tune.schedulers.ASHAScheduler(
time_attr='training_iteration',
metric='episode_reward_mean',
mode='max',
max_t=2000,
grace_period=100,
reduction_factor=3,
brackets=3),
)
In the worker logs, we would find the following message around the time we get the object lost error:
2020-10-04 00:19:40,710\u0009WARNING worker.py:1072 -- The node with node id f7c78d2999929f603ebdf4d2c4508f949f6dafb0 has been marked dead because the detector has missed too many heartbeats from it.
Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the num_heartbeats_timeout setting in --system_config, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.
We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 89 (25 by maintainers)
I upped my file descriptors to ~16k… but still crashing, but getting more specific error now… not something I’ve seen before…
Ray worker pid: 24628 WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl e.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-10-21 00:25:36,762 ERROR worker.py:372 – SystemExit was raised from the worker Traceback (most recent call last): File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py”, line 2328, in get_attr pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf) tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation ‘default_policy/Sum_4’ has no attr named ‘_XlaCompile’.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py”, line 331, in _MaybeCompile xla_compile = op.get_attr(“_XlaCompile”) File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py”, line 2332, in get_attr raise ValueError(str(e)) ValueError: Operation ‘default_policy/Sum_4’ has no attr named ‘_XlaCompile’.
Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes… but figured I’d pas that along…
With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes.
Yep, 0.8.5 doesn’t centralize actor management, so the probability of collision is much lower (you would have to have collisions within individual trials, which is extremely unlikely since each trial only had ~200 actors). However, the central GCS tracks all actors over all time, so collisions there become inevitable once you cycle through enough actors in your app.
That said it does seem likely your app has other issues. Let’s see if this PR fixes them. If not, we should create a new issue thread since this one is quite overloaded by now.
@soundway good news, I managed to repro. Filtering the logs with
cat raylet.out | grep -v “soft limit” | grep -v “Unpinning object” | grep -v “Failed to send local” | grep -v “Connected to” | grep -v “Sending local GC” | grep -v “Last heartbeat was sent” | grep -v "took " | grep -v "Failed to kill
dmesg:
It could be that triggering GC somehow caused a segfault in the worker. I’ll look into trying to reproduce this scenario.
@ericl In my setup, just as I start the Tune cluster I immediately observe issues, like trials not starting, although marked as RUNNING. After a while (can be ~1 day) they are marked as FAILED. Please see the following output (from ~1 day since beginning of Tune run). Just to give some context, this is 32cores/node (CPU-only) with ~128GB RAM cluster that has enough nodes/cores to be able to run all trials simultaneously.
Errors:
I tried various configurations today, and I increased my soft/hard ulimit -n to ~63k on head & workers…
Probably the only thing I can say is that for the most part, the same Trial, using fairly limited resources, 32 workers… ran fine on a small cluster, roughly 32-150 cpus… but as the number nodes increased the Trial would begin breaking, mostly right after starting, prior to completing a complete iteration…. Yet a couple times I was able to get the Trial to run for a long time on a larger cluster…
At no time can I ever get large resource usage on a large cluster… everything breaks… yet I can’t seem to come up with an example to replicate, as my MockEnv runs fine…so clearly something about the custom env/model relative to the basic MockEnv that creates problems…
For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven’t done this too rigorously and I can revisit this.
I haven’t tried running long jobs in the cloud with just one instance yet in the but I can try that.
I never use TF – everything I’ve done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well.
I started (4) with current setup with Q-bert and they haven’t crashed yet, but will give you updates on this.