ray: [core] Possible memory leak in async actor implementation
@shrekris-anyscale and I have been trying to track down a memory leak issue happening across multiple serve components. All of them are async actors and the growth only happens when actor calls are being made, so it lines up that it could be a memory leak related to async actor calls.
I ran a minimal script to send batches of many async actor calls and memory appears to be consistently growing. This is running on nightly wheels, commit sha: 0b8fd1d6dd7e83656713b21bb408fa7f04244bfb.
Repro script (pinning to head node):
import ray
@ray.remote
class A:
async def hi(self):
return "hi"
a = A.options(resources={"node:10.0.62.86": 0.1}).remote()
while True:
ray.get([a.hi.remote() for _ in range(10000)])
Memory growth over 3 hours:
Link to the workspace where this is currently running (I plan to leave it going overnight, please don’t touch):
For reference, here is a run of the same script but dropping the async def:
import ray
@ray.remote
class A:
def hi(self):
return "hi"
a = A.options(resources={"node:10.0.62.86": 0.1}).remote()
while True:
ray.get([a.hi.remote() for _ in range(10000)])
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 3
- Comments: 15 (15 by maintainers)
Async actor’s memory keeps going up. Reproduced
Left is with jemalloc, right is without jemalloc
Ok, I believe we’ve narrowed down that this is an issue related to
max_concurrency/ concurrency groups. Here’s the updated repro:With this, there’s a clear leak in the
Executoractors: