ray: Possible memory leak in Ape-X
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
- Ray installed from (source or binary): binary
- Ray version: 0.6.0
- Python version: 2.7
- Exact command to reproduce: rllib train -f crash.yaml
You can run this on any 64-core CPU machine:
crash.yaml:
apex:
env:
grid_search:
- BreakoutNoFrameskip-v4
- BeamRiderNoFrameskip-v4
- QbertNoFrameskip-v4
- SpaceInvadersNoFrameskip-v4
run: APEX
config:
double_q: false
dueling: false
num_atoms: 1
noisy: false
n_step: 3
lr: .0001
adam_epsilon: .00015
hiddens: [512]
buffer_size: 1000000
schedule_max_timesteps: 2000000
exploration_final_eps: 0.01
exploration_fraction: .1
prioritized_replay_alpha: 0.5
beta_annealing_fraction: 1.0
final_prioritized_replay_beta: 1.0
num_gpus: 0
# APEX
num_workers: 8
num_envs_per_worker: 8
sample_batch_size: 20
train_batch_size: 1
target_network_update_freq: 50000
timesteps_per_iteration: 25000
Describe the problem
Source code / logs
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
self._wait_for_and_process_task(task)
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
self._process_task(task, execution_info)
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
traceback_str)
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
self._store_outputs_in_object_store(return_object_ids, failure_objects)
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
self.put_object(object_ids[i], outputs[i])
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
self.store_and_register(object_id, value)
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
self.task_driver_id))
File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
return orig_attr(*args, **kwargs)
File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
buffer = self.create(target_id, serialized.total_bytes)
File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
check_status(self.client.get().Create(object_id.data, data_size,
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
raise ArrowIOError(message)
ArrowIOError: Broken pipe
This error is unexpected and should not have happened. Somehow a worker
crashed in an unanticipated way causing the main_loop to throw an exception,
which is being caught in "python/ray/workers/default_worker.py".
The rest of the experiment keeps running, but the particular trial fails.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (6 by maintainers)
@ericl and I determined that the error messages like
The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang.are expected, but we should fix the backend so that the job doesn’t hang. I’m currently working on a PR to treat the task as failed if the object really has been evicted.