ray: [core] Idle workers from a completed job are not killed due to borrowed references
What happened + What you expected to happen
Run following scripts; once the job finished the worker is leaked because task argument obj is still being referenced.
repro.py
import ray
import numpy as np
import tensorflow
def leak_repro(obj):
# If import is moved here then the leak does not occur.
# import tensorflow
tensorflow
return []
ds = ray.data.from_numpy(np.ones((100_000)))
ds.map(leak_repro)
IDLE processes will stay active.

Objects will stay pinned in memory.
--- Summary for node address: 172.31.20.214 ---
Mem Used by Objects Local References Pinned Used by task Captured in Objects Actor Handles
241198.0 B 0, (0.0 B) 2, (241198.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B)
--- Object references for node address: 172.31.20.214 ---
IP Address PID Type Call Site Status Size Reference Type Object Ref
172.31.20.214 2216 Worker (deserialize task arg) - 120599.0 B PINNED_IN_MEMORY c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000
ray.data._internal.co
mpute._map_block_nospl
it
172.31.20.214 2215 Worker (deserialize task arg) - 120599.0 B PINNED_IN_MEMORY 16310a0f0a45af5cffffffffffffffffffffffff0100000001000000
ray.data._internal.co
mpute._map_block_nospl
it
Versions / Dependencies
latest ray
Reproduction script
Create repro.py and run python repro.py
Issue Severity
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 6
- Comments: 30 (22 by maintainers)
The objects in question appears to be the task arg, which may explain why it is owned by the worker, and hence holding up the worker from being freed.
IP Address PID Type Call Site Status Size Reference Type Object Ref
172.31.198.64 147136 Worker (deserialize task arg) - 3485.0 B PINNED_IN_MEMORY 00ffffffffffffffffffffffffffffffffffffff0100000002000000 ray.data.internal.ex
ecution.operators.map
operator._map_task
172.31.198.64 147136 Worker (deserialize task arg) - 800468.0 B PINNED_IN_MEMORY c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000 ray.data.internal.ex
ecution.operators.map
operator._map_task
One edge case with killing the worker when the job dies is the object may be used by a detached actor
Furthermore, we should probably kill workers unconditionally if the job has finished, even if they contain owned objects.
Maybe upgrade your arrow version? Or, you can try this tweak:
Just a quick note, I was able to reproduce this with Ray 1.12.0, so this isn’t a (recent) regression.