ray: [rllib] Slowly running out of memory in eager + tracing

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): source
Ray version: 0.6.5
Python version: 3.6.8
Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000

Describe the problem

The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).

The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.

Source code / logs


2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
  File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
    raise e
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
    result = Trainable.train(self)
  File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
    self.optimizer.step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
    sample_timesteps, train_timesteps = self._step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
    counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:

PID     MEM     COMMAND
44603   34.89GB ray_ApexAgent:train()
44591   12.94GB ray_ReplayActor:add_batch()
44612   12.91GB ray_ReplayActor:add_batch()
44617   12.83GB ray_ReplayActor:add_batch()
44632   12.83GB ray_ReplayActor:add_batch()

In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (5 by maintainers)

Most upvoted comments

That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.

pengzhenghao on Apr 6, 2019