ray: [rllib] Slowly running out of memory in eager + tracing

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): source
  • Ray version: 0.6.5
  • Python version: 3.6.8
  • Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000

Describe the problem

The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).

The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.

Source code / logs


2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
  File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
    raise e
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
    result = Trainable.train(self)
  File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
    self.optimizer.step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
    sample_timesteps, train_timesteps = self._step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
    counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:

PID     MEM     COMMAND
44603   34.89GB ray_ApexAgent:train()
44591   12.94GB ray_ReplayActor:add_batch()
44612   12.91GB ray_ReplayActor:add_batch()
44617   12.83GB ray_ReplayActor:add_batch()
44632   12.83GB ray_ReplayActor:add_batch()

In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (5 by maintainers)

Most upvoted comments

That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.