ray: Memory leak when simple worker returns a numpy object
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 LTS x86_64
- Ray installed from (source or binary): binary (
pip install ray) - Ray version: 0.5.0
- Python version: Python 3.6.0 :: Continuum Analytics, Inc.
- Exact command to reproduce:
python main.py:
Describe the problem
The code below results in a memory leak. The ring buffer in the main process containing numpy images should take about 10GB. However after the buffer cycles the memory continues increasing. With the parameters in the script below the memory will increase to 16GB in about 500,000 steps.
I also reproduced this behavior on a server with 126GB RAM and 40 threads. The system will be out of memory after a few tens of millions of steps (common on RL tasks).
I first noticed this with workers returning atari screens and tried to provide a minimal working example that reproduces the issue.
edit: the output of the script will seem fine, however top or similar will show the increasing memory usage.
Source code / logs
import os
import psutil
import gc
import random
import numpy as np
import ray
import pyarrow
import lz4.frame
import base64
def pack(data):
data = pyarrow.serialize(data).to_buffer().to_pybytes()
data = lz4.frame.compress(data)
return base64.b64encode(data)
def unpack(data):
data = base64.b64decode(data)
data = lz4.frame.decompress(data)
return pyarrow.deserialize(data)
@ray.remote(num_cpus=1, num_gpus=0)
def sample(step_cnt):
# some pseudo-random "images" that can still be compressed.
state = np.full((128, 128, 3), step_cnt % 255, dtype=np.uint8)
state[24:48, 64:96, :] = np.random.randint(0, 255, (24, 32, 3), np.uint8)
# display process memory
if step_cnt % 50000 == 0:
process = psutil.Process(os.getpid())
proc_mem = process.memory_info().rss / (1024**2)
print(f'actor_pid={process.pid} \t mem={proc_mem:6.1f} MB.')
return pack(state)
def main():
ray.init()
actor_no = 10
step_no = 5000000
buff_sz = 200000
# init buffer
buffer = [None for i in range(buff_sz)]
step_cnt = 0
while step_cnt < step_no:
states = ray.get([sample.remote(step_cnt) for i in range(actor_no)])
for state in states:
# add to buffer
buffer[step_cnt % buff_sz] = unpack(state)
step_cnt += 1
# display process memory
if step_cnt % 5000 == 0:
print('-' * 30)
print(f'steps={step_cnt} / buffer={len(buffer)}.')
process = psutil.Process(os.getpid())
proc_mem = process.memory_info().rss / (1024**2)
print(f'main_pid={process.pid} \t mem={proc_mem:6.1f} MB.')
print('-' * 30)
if __name__ == '__main__':
main()
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 32 (10 by maintainers)
Great to hear that! The
redis_max_memoryandobject_store_memorysettings don’t really interact.When you call
ray.init(), several processes are started.The object store will fill up with objects (until some limit, and then it will evict the least recently used objects). This limit is what you’re setting with
object_store_memory.The Redis servers will also fill up with metadata (until some limit, and then they will evict the least recently used values). That’s what you’re setting with
redis_max_memory.In the current master,
object_store_memoryis capped at 20GB by default andredis_max_memoryis capped at 10GB by default, though these defaults can be overridden. Note that the 10GB cap onredis_max_memorywas merged yesterday and isn’t in0.6.1.I’m closing this issue because I think the underlying issue has been addressed by #3615. However, if you run into more problems feel free to file a new issue (or reopen this one if it’s the same issue).
Just to clarify, “Redis” and the “plasma store” are two separate things.
We use the plasma store to store data (e.g., objects returned by tasks).
We use Redis to store metadata, e.g., specifications of tasks (which are used to rerun tasks if necessary) and data about which objects live on which machines (we actually start multiple Redis servers).
Did you happen to see which process was using up all of the memory? Also note that you can install the latest wheels (to avoid compilation) by following the instructions at http://ray.readthedocs.io/en/latest/installation.html#trying-the-latest-version-of-ray.
I see, for me the %MEM in htop for the main python process starts at ~30% in steady state, creeps up to ~40% max and then quickly drops to ~30%, slowly rising up to 40% and this pattern is repeating.
Are you seeing a similar pattern or is it just going up for you? I’m on a 32GB machine.
The memory of the redis server is increasing though, which is expected. We have a way to deal with that in http://ray.readthedocs.io/en/latest/redis-memory-management.html, which is experimental at the moment.