ray: [rllib] Frequent “the actor died unexpectedly before finishing this task” errors with executions ops in Ray/RLLib 0.8.7+

This is not a contribution.

Versions: python: 3.6.8 ray: 1.0 pytorch: 1.6 tensorflow: 1.15 OS: Ubuntu 18.04 Docker

Since upgrading to 0.8.7 and 1.0, we are experiencing multiple stability issues that result in jobs crashing with The actor died unexpectedly before finishing this task errors. Note that these issues are quite difficult to reproduce using the default environment provided by RLLib (often needs over 40 hours for QBert), but with our custom environment they happen much earlier during the execution — sometimes as early as 4 minutes, and they also happen very consistently. We’ve never experienced anything like this with 0.8.5 or prior. Memory/resource shouldn’t be the bottleneck. Even though our custom environments use more memory, we also use nodes with much larger memory capacity for their rollouts. We closely monitor them via Grafana to ensure that all usages fall well below what’s available (i.e. overall memory usage is usually far below 50%). For every node, we assign 30% of the node’s memory for object store, which should be far more than enough based on the experience/model size.

Here’s an example of the errors (produced by the script provided later):

2020-10-05 01:55:09,393\u0009ERROR trial_runner.py:567 -- Trial PPO_QbertNoFrameskip-v4_b43b9_00027: Error processing event.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: \u001b[36mray::PPO.train()\u001b[39m (pid=4251, ip=172.30.96.106)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 791, in apply_foreach
    result = fn(item)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/metric_ops.py", line 79, in __call__
    timeout_seconds=self.timeout_seconds)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/evaluation/metrics.py", line 75, in collect_episodes
    metric_lists = ray.get(collected)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here’s another variant of the error when running our own custom environment:

Failure # 1 (occurred at 2020-10-03_02-10-38)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::PPO.train() (pid=524, ip=172.30.58.198)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 516, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 876, in apply_flatten
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 828, in add_wait_hooks
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Here’s the example script that produced the first error by training QBert with PPO. Note that it might take over 40 hours for the error to occur. The setup is a p3.2xlarge instance for the trainer, and the rollout workers are on a c5.18xlarge instance. 30% of memory on each instance is dedicated to object store.

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo


if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

One of the things we tried when debugging the problem is by storing all execution ops references in memory — and somehow it helps. We discovered this mitigation almost accidentally as we were debugging our own execution plan. For instance, for the PPO execution plan, if we modify it to also return all execution ops in a list that gets held in memory, then the time it takes for the job to crash gets significantly increased and we no longer get the same error. Instead, the error becomes ray.exceptions.ObjectLostError: Object XXXXX is lost due to node failure – which seems to be caused by some node failed heartbeat check. It’s unclear if our attempted mitigation is just a fluke or it may point in the right direction to fix the underlying problem, or these errors share the same underlying cause. Here’s a modified script. Note that the new error is no longer guaranteed to be reproducible even when running for a long time. But with our environment it’s quite consistent:

import copy

import gym
import numpy as np
import ray
import ray.rllib.agents.ppo as ppo
from ray.rllib.agents.ppo.ppo import UpdateKL, warn_about_bad_reward_scales
from ray.rllib.execution.common import STEPS_SAMPLED_COUNTER, _get_shared_metrics
from ray.rllib.execution.rollout_ops import ParallelRollouts, ConcatBatches, \
    StandardizeFields, SelectExperiences
from ray.rllib.execution.train_ops import TrainOneStep
from ray.rllib.execution.metric_ops import StandardMetricsReporting
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.util.iter import from_actors


def custom_ppo_execution_plan(workers, config):
    """Copy of PPO's execution plan, except we store all ops in a list and return them."""
    # Modified from ParallelRollout's bulk_sync mode.
    workers.sync_weights()
    def report_timesteps(batch):
        metrics = _get_shared_metrics()
        metrics.counters[STEPS_SAMPLED_COUNTER] += batch.count
        return batch
    ops = [from_actors(workers.remote_workers())]
    ops.append(ops[-1].batch_across_shards())
    ops.append(ops[-1].for_each(lambda batches: SampleBatch.concat_samples(batches)))
    ops.append(ops[-1].for_each(report_timesteps))

    # Collect batches for the trainable policies.
    ops.append(ops[-1].for_each(
        SelectExperiences(workers.trainable_policies())))
    # Concatenate the SampleBatches into one.
    ops.append(ops[-1].combine(
        ConcatBatches(min_batch_size=config["train_batch_size"])))
    # Standardize advantages.
    ops.append(ops[-1].for_each(StandardizeFields(["advantages"])))

    # Perform one training step on the combined + standardized batch.
    ops.append(ops[-1].for_each(
        TrainOneStep(
            workers,
            num_sgd_iter=config["num_sgd_iter"],
            sgd_minibatch_size=config["sgd_minibatch_size"])))

    # Update KL after each round of training.
    ops.append(ops[-1].for_each(lambda t: t[1]).for_each(UpdateKL(workers)))

    # Warn about bad reward scales and return training metrics.
    return (StandardMetricsReporting(ops[-1], workers, config) \
        .for_each(lambda result: warn_about_bad_reward_scales(config, result)),
        ops)

class ExecutionPlanWrapper:
    """A wrapper for custom_ppo_execution_plan that stores all ops in the object."""

    def __init__(self, workers, config):
        self.execution_plan, self.ops = custom_ppo_execution_plan(workers, config)

    def __next__(self):
        return next(self.execution_plan)


if __name__ == '__main__':
    ray.init(address="auto")

    config = copy.deepcopy(ppo.DEFAULT_CONFIG)
    config.update({
        "rollout_fragment_length": 32,
        "train_batch_size": 8192,
        "sgd_minibatch_size": 512,
        "num_sgd_iter": 1,
        "num_workers": 256,
        "num_gpus": 1,
        "num_sgd_iter": 1,
        "num_cpus_per_worker": 0.25,
        "num_cpus_for_driver": 1,
        "model": {"fcnet_hiddens": [1024, 1024]},
        "framework": "torch",
        "lr": ray.tune.sample_from(lambda s: np.random.random()),
    })

    trainer_cls = ppo.PPOTrainer.with_updates(
        name="CustomPPO",
        execution_plan=ExecutionPlanWrapper)

    config["env"] = "QbertNoFrameskip-v4"
    ray.tune.run(trainer_cls,
                 config=config,
                 fail_fast=True,
                 reuse_actors=False,
                 queue_trials=True,
                 num_samples=100,
                 scheduler=ray.tune.schedulers.ASHAScheduler(
                    time_attr='training_iteration',
                    metric='episode_reward_mean',
                    mode='max',
                    max_t=2000,
                    grace_period=100,
                    reduction_factor=3,
                    brackets=3),
                 )

In the worker logs, we would find the following message around the time we get the object lost error:

2020-10-04 00:19:40,710\u0009WARNING worker.py:1072 -- The node with node id f7c78d2999929f603ebdf4d2c4508f949f6dafb0 has been marked dead because the detector has missed too many heartbeats from it.

Further, sometimes — not always, the node that timed out has a drastic sharp increase (2-3x) in memory usage according to our Grafana within several seconds near the end — which is far more than the amount of memory it should use. We attempted to mitigate this second error by increasing the num_heartbeats_timeout setting in --system_config, but it doesn’t seem to make much difference. None of these issues exist with the old optimizer scheme in 0.8.5 or earlier and we can train with our custom environment for days without any issue.

We also encounter problems that after a trial terminates, a new trial doesn’t get started for some reason in certain cases (this can only be reproduced with our environments). It’s unclear if that’s related to the issue above at all and it’s been hard to debug it with these other instability issues. We’ll likely file another more detailed bug report related to that later when this is addressed.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 89 (25 by maintainers)

Most upvoted comments

I upped my file descriptors to ~16k… but still crashing, but getting more specific error now… not something I’ve seen before…

Ray worker pid: 24628 WARNING:tensorflow:From /home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariabl e.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2020-10-21 00:25:36,762 ERROR worker.py:372 – SystemExit was raised from the worker Traceback (most recent call last): File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py”, line 2328, in get_attr pywrap_tf_session.TF_OperationGetAttrValueProto(self._c_op, name, buf) tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation ‘default_policy/Sum_4’ has no attr named ‘_XlaCompile’.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py”, line 331, in _MaybeCompile xla_compile = op.get_attr(“_XlaCompile”) File “/home/svc-tai-dev/virt/algo_37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py”, line 2332, in get_attr raise ValueError(str(e)) ValueError: Operation ‘default_policy/Sum_4’ has no attr named ‘_XlaCompile’.

Is the gist of it…. It’s late here, and I’ll post a more complete set of logs, and make sure I am not making any mistakes… but figured I’d pas that along…

On Oct 20, 2020, at 9:12 PM, Eric Liang notifications@github.com wrote:

@soundway https://github.com/soundway @waldroje https://github.com/waldroje after experimenting with 1.0 vs 0.8.5 more, I think the main difference is we use 2-3x more file descriptors due to the change in the way actors are managed with the GCS service— it’s not really a leak. I’ll put more details in the linked bug.

I believe that increasing the file descriptor limit (ulimit -n value) will resolve the problem, can you try increasing the limit to 10000 or more? The number of fds opened seems to stabilize at just a few thousand.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713229044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMNMMR2DMGHPPFRHFHTSLYYRXANCNFSM4SGQX3BA.

With 0.8.5 all memory usages are stable for long running jobs with many actors/nodes.

Yep, 0.8.5 doesn’t centralize actor management, so the probability of collision is much lower (you would have to have collisions within individual trials, which is extremely unlikely since each trial only had ~200 actors). However, the central GCS tracks all actors over all time, so collisions there become inevitable once you cycle through enough actors in your app.

That said it does seem likely your app has other issues. Let’s see if this PR fixes them. If not, we should create a new issue thread since this one is quite overloaded by now.

@soundway good news, I managed to repro. Filtering the logs with

cat raylet.out | grep -v “soft limit” | grep -v “Unpinning object” | grep -v “Failed to send local” | grep -v “Connected to” | grep -v “Sending local GC” | grep -v “Last heartbeat was sent” | grep -v "took " | grep -v "Failed to kill

[2020-12-11 08:38:20,767 W 119255 119255] node_manager.cc:3209: Broadcasting global GC request to all raylets. This is usually because clusters have memory pressure, and ray needs to GC unused memory.
[2020-12-11 08:49:44,024 W 119255 119255] node_manager.cc:3209: Broadcasting global GC request to all raylets. This is usually because clusters have memory pressure, and ray needs to GC unused memory.
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker 4a2ee6805a94a4537df3266bab953bbf2c2c1fe1
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker d4ebdcaed62ef83a85ed7182ef3240c8eea3a208
[2020-12-11 08:49:44,176 I 119255 119255] node_manager.cc:819: Owner process fa751d03dfb76daaa64c29074758bc6e972fc5b7 died, killing leased worker 193750a83443f8da74038e6564d8b979c6356988

dmesg:

[Thu Dec 10 02:46:50 2020] ray::PPO[120939]: segfault at 8 ip 00007f4830ce5bce sp 00007ffee9355c60 error 4 in libtorch.so[7f482ca53000+2f7bf000]
[Thu Dec 10 02:46:50 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Thu Dec 10 19:49:06 2020] ray::PPO[95647]: segfault at 68d4b2ab ip 00007f82fc624bce sp 00007ffd457c8f90 error 4 in libtorch.so[7f82f8392000+2f7bf000]
[Thu Dec 10 19:49:06 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Thu Dec 10 23:53:53 2020] ray::PPO[99828]: segfault at 48 ip 00007f043a353bce sp 00007ffd63409bc0 error 4 in libtorch.so[7f04360c1000+2f7bf000]
[Thu Dec 10 23:53:53 2020] Code: 31 d2 55 53 48 83 ec 08 48 63 2e 4c 8b 4f 08 48 89 e8 49 f7 f1 48 8b 07 4c 8b 14 d0 48 89 d3 4d 85 d2 74 50 49 8b 0a 49 89 eb <44> 8b 41 08 eb 23 0f 1f 40 00 48 8b 01 48 85 c0 74 38 44 8b 40 08
[Fri Dec 11 03:40:55 2020] traps: ray::PPO[64854] general protection fault ip:7f4dc8ce5bce sp:7ffe8c2291d0 error:0 in libtorch.so[7f4dc4a53000+2f7bf000]
[Fri Dec 11 09:07:51 2020] traps: ray::PPO[113262] general protection fault ip:7ece9f666bce sp:7fff943fa950 error:0 in libtorch.so[7ece9b3d4000+2f7bf000]

It could be that triggering GC somehow caused a segfault in the worker. I’ll look into trying to reproduce this scenario.

I’ll spend more time trying to reproduce this. Seems like I might just have to bite the bullet of waiting a couple days for a crash.

As usual, if anyone has a fast reproducible crash that would be very helpful.

@ericl In my setup, just as I start the Tune cluster I immediately observe issues, like trials not starting, although marked as RUNNING. After a while (can be ~1 day) they are marked as FAILED. Please see the following output (from ~1 day since beginning of Tune run). Just to give some context, this is 32cores/node (CPU-only) with ~128GB RAM cluster that has enough nodes/cores to be able to run all trials simultaneously.


+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
| Trial name                                                 | status   | loc               |   num_envs_per_worker |   num_workers |   rollout_fragment_length |   iter |   total time (s) |      ts |   reward |
|------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 | ERROR    |                   |                     3 |            31 |                        50 |    570 |          4869.03 | 2897185 | 0.686032 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00001 | RUNNING  | 10.120.253.10:69  |                     5 |            31 |                        50 |    480 |          6518.52 | 2470741 | 0.705774 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00002 | RUNNING  | 10.120.253.5:81   |                     3 |            93 |                        50 |    714 |          4301.41 | 3628635 | 0.748065 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00003 | RUNNING  |                   |                     5 |            93 |                        50 |        |                  |         |          |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00004 | RUNNING  | 10.120.252.231:75 |                     3 |           155 |                        50 |   1587 |          1951.24 | 8107295 | 0.671751 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00005 | RUNNING  | 10.120.253.66:74  |                     5 |           155 |                        50 |    453 |          6440.03 | 2283493 | 0.540052 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00006 | RUNNING  | 10.120.252.198:79 |                     3 |            31 |                       100 |    433 |          6805.06 | 2244494 | 0.679827 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00007 | RUNNING  | 10.120.252.243:89 |                     5 |            31 |                       100 |    310 |          9105.32 | 1588171 | 0.507826 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00008 | RUNNING  | 10.120.253.53:85  |                     3 |            93 |                       100 |    443 |          6774.8  | 2279045 | 0.50638  |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00009 | RUNNING  | 10.120.252.222:79 |                     5 |            93 |                       100 |    330 |          8890.16 | 1680894 | 0.513259 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00010 | RUNNING  | 10.120.252.229:76 |                     3 |           155 |                       100 |    473 |          6002.71 | 2437666 | 0.542599 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00011 | RUNNING  | 10.120.252.198:82 |                     5 |           155 |                       100 |    312 |          8455.52 | 1593173 | 0.417053 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00012 | RUNNING  | 10.120.252.216:77 |                     3 |            31 |                       200 |    270 |          8964.67 | 1484697 | 0.488652 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00013 | RUNNING  | 10.120.252.201:61 |                     5 |            31 |                       200 |    151 |         10504.3  |  762662 | 0.458328 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00014 | RUNNING  | 10.120.252.202:78 |                     3 |            93 |                       200 |    261 |          8668.87 | 1437558 | 0.478272 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 | ERROR    |                   |                     5 |            93 |                       200 |    127 |          9100.73 |  640695 | 0.432807 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00016 | RUNNING  | 10.120.252.209:76 |                     3 |           155 |                       200 |    249 |          9255.21 | 1368668 | 0.437783 |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00017 | RUNNING  | 10.120.253.25:87  |                     5 |           155 |                       200 |     46 |         21294.9  |  233320 | 0.596508 |
+------------------------------------------------------------+----------+-------------------+-----------------------+---------------+---------------------------+--------+------------------+---------+----------+
Number of errored trials: 2
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                                                 |   # failures | error file                                                                                                                                                                                          |
|------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00000 |            1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-526qieqnu0/error.txt   |
| new_base_NegotiationMergeHandoffFullLateralEnv_122ab_00015 |            1 | /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_15_num_envs_per_worker=5,num_workers=93,rollout_fragment_length=200_2020-11-17_20-47-53s2c_xath/error.txt |
+------------------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Errors:

$kz430x@dcmixphhpc009:~> cat /home/kz430x/ray_results/stress_testing_54/new_base_NegotiationMergeHandoffFullLateralEnv_0_num_envs_per_worker=3,num_workers=31,rollout_fragment_length=50_2020-11-17_20-47-5qieqnu0/error.txt
Failure # 1 (occurred at 2020-11-18_03-14-29)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 468, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1474, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::new_base.train() (pid=68, ip=10.120.253.26)
  File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 497, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 486, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 132, in _train
    return self._train_exec_impl()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 170, in _train_exec_impl
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 731, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 814, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 799, in add_wait_hooks
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 744, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 525, in base_iterator
    batch = ray.get(obj_id)
ray.exceptions.UnreconstructableError: Object 0d4ced372d8e5c5ffc7dfb5a010000c001000000 is lost (either LRU evicted or deleted by user) and cannot be reconstructed. Try increasing the object store memory available with ray.init(object_store_memory=<bytes>) or setting object store limits with ray.remote(object_store_memory=<bytes>). See also: https://docs.ray.io/en/latest/memory-management.html

I tried various configurations today, and I increased my soft/hard ulimit -n to ~63k on head & workers…

Probably the only thing I can say is that for the most part, the same Trial, using fairly limited resources, 32 workers… ran fine on a small cluster, roughly 32-150 cpus… but as the number nodes increased the Trial would begin breaking, mostly right after starting, prior to completing a complete iteration…. Yet a couple times I was able to get the Trial to run for a long time on a larger cluster…

At no time can I ever get large resource usage on a large cluster… everything breaks… yet I can’t seem to come up with an example to replicate, as my MockEnv runs fine…so clearly something about the custom env/model relative to the basic MockEnv that creates problems…

On Oct 21, 2020, at 2:15 PM, Eric Liang notifications@github.com wrote:

I see, minutes would be ideal but even a couple hours isn’t too bad. 40 hours is a bit hard to deal with though 😕

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/11239#issuecomment-713770664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSHWMJWGAXDHZUPIPDSS3TSL4QLBANCNFSM4SGQX3BA.

For (1) I have tried something like that (with varying obs size, model size, environment memory occupancy), but unfortunately it still takes roughly the same amount of time to crash, and I could not get it to crash within minutes. Though I haven’t done this too rigorously and I can revisit this.

I haven’t tried running long jobs in the cloud with just one instance yet in the but I can try that.

I never use TF – everything I’ve done here is with Torch (1.6 specifically, but we also see problem with 1.4). I could try TF as well.

I started (4) with current setup with Q-bert and they haven’t crashed yet, but will give you updates on this.