ray: Possible memory leak in Ape-X

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.0
  • Python version: 2.7
  • Exact command to reproduce: rllib train -f crash.yaml

You can run this on any 64-core CPU machine:

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Describe the problem

Source code / logs

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
    self._process_task(task, execution_info)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
    traceback_str)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
    self._store_outputs_in_object_store(return_object_ids, failure_objects)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
    self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
    self.store_and_register(object_id, value)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
    self.task_driver_id))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
  File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    raise ArrowIOError(message)
ArrowIOError: Broken pipe

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".
  

The rest of the experiment keeps running, but the particular trial fails.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

@ericl and I determined that the error messages like The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang. are expected, but we should fix the backend so that the job doesn’t hang. I’m currently working on a PR to treat the task as failed if the object really has been evicted.