ray: [rllib] Slowly running out of memory in eager + tracing
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): source
- Ray version: 0.6.5
- Python version: 3.6.8
- Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000
Describe the problem
The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).
The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.
Source code / logs
2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
result = ray.get(trial_future[0])
File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
raise e
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
result = Trainable.train(self)
File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
result = self._train()
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
self.optimizer.step()
File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
sample_timesteps, train_timesteps = self._step()
File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:
PID MEM COMMAND
44603 34.89GB ray_ApexAgent:train()
44591 12.94GB ray_ReplayActor:add_batch()
44612 12.91GB ray_ReplayActor:add_batch()
44617 12.83GB ray_ReplayActor:add_batch()
44632 12.83GB ray_ReplayActor:add_batch()
In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.
The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (5 by maintainers)
That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.