ray: [RLlib] Memory leaks during RLlib training.

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): OS: docker on centos ray:0.8.4 python:3.6

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Recently, we found our RL model trained by rllib will deplete memory and throw OOM error. Then I run a rllib DQN model as belows, the memory usage grows as time pass by.

rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_1M/", "output_max_file_size": 50000000,"num_workers":3}' --stop='{"timesteps_total": 1000000}'

Memory grows as time goes on:

Hope someone can give some help.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 29 (5 by maintainers)

Most upvoted comments

[closing as stale]

ericl on Aug 2, 2021

I experience the same problem with APEX-DQN running in local mode with multiple workers. Memory usage linearly rises, and the experiments fail with RayOutOfMemoryError at some point.

I have tried setting the buffer_size to a smaller value, though I did not figure out what exactly the number is supposed to mean even after some invesitgation in the docs (is it # samples or bytes?) and it did not stop the memory error.

The traceback shows RolloutWorker occupying 56 of 64 GB. Feels like a memory leak to me.

Running on 0.8.5

wullli on Aug 24, 2020

@Mark2000 I restore a stopped tune run with:

tune.run(
  "TD3", # same as the original run
  name="<folder_name_containing_the_trial_data>", # for instance TD3_2023-10-23_18-17-45 
  local_dir="<absolute_path_to_the_local_dir_of_the_original_run>", 
  resume=True
)

You can also use resume="ERRORED_ONLY" instead if you need to restart failed instances only. In my experience the restored trial works well in some cases but in others the restored trial behaves very differently from the original one. See for instance the following plot for a trial that was restored at ~600k steps after which the reward curve displays a very different profile.

Using Ray 2.7.1 + torch 2.1.0+cu118 on an Ubuntu 22.04 system.

jesuspc on Oct 25, 2023

Supposedly this has been fixed via #15815. However, I still see memory leaks when running multi-worker training.

chazzmoney on Jan 8, 2022

Oh, that might be because setting num workers enables distributed mode. We’ve fixed some memory leaks in 0.8.5, so it’s worth upgrading to see.

ericl on May 21, 2020