ray: [core] Idle workers from a completed job are not killed due to borrowed references

What happened + What you expected to happen

Run following scripts; once the job finished the worker is leaked because task argument obj is still being referenced.

repro.py

import ray
import numpy as np
import tensorflow

def leak_repro(obj):
    # If import is moved here then the leak does not occur.
    # import tensorflow
    tensorflow
    return []

ds = ray.data.from_numpy(np.ones((100_000)))
ds.map(leak_repro)

IDLE processes will stay active. image

Objects will stay pinned in memory.

--- Summary for node address: 172.31.20.214 ---
Mem Used by Objects  Local References  Pinned        Used by task   Captured in Objects  Actor Handles
241198.0 B           0, (0.0 B)        2, (241198.0 B)  0, (0.0 B)     0, (0.0 B)           0, (0.0 B)   

--- Object references for node address: 172.31.20.214 ---
IP Address       PID    Type    Call Site               Status          Size    Reference Type      Object Ref                                              
172.31.20.214    2216   Worker  (deserialize task arg)  -               120599.0 B  PINNED_IN_MEMORY    c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000
                                 ray.data._internal.co                                                                                                      
                                mpute._map_block_nospl                                                                                                      
                                it                                                                                                                          

172.31.20.214    2215   Worker  (deserialize task arg)  -               120599.0 B  PINNED_IN_MEMORY    16310a0f0a45af5cffffffffffffffffffffffff0100000001000000
                                 ray.data._internal.co                                                                                                      
                                mpute._map_block_nospl                                                                                                      
                                it                                                                   

Versions / Dependencies

latest ray

Reproduction script

Create repro.py and run python repro.py

Issue Severity

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 6
  • Comments: 30 (22 by maintainers)

Most upvoted comments

The objects in question appears to be the task arg, which may explain why it is owned by the worker, and hence holding up the worker from being freed.

IP Address PID Type Call Site Status Size Reference Type Object Ref
172.31.198.64 147136 Worker (deserialize task arg) - 3485.0 B PINNED_IN_MEMORY 00ffffffffffffffffffffffffffffffffffffff0100000002000000 ray.data.internal.ex
ecution.operators.map

operator._map_task

172.31.198.64 147136 Worker (deserialize task arg) - 800468.0 B PINNED_IN_MEMORY c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000 ray.data.internal.ex
ecution.operators.map

operator._map_task

One edge case with killing the worker when the job dies is the object may be used by a detached actor

Furthermore, we should probably kill workers unconditionally if the job has finished, even if they contain owned objects.

Maybe upgrade your arrow version? Or, you can try this tweak:


import ray
import numpy as np
import tensorflow

def leak_repro(obj):
    # If import is moved here then the leak does not occur.
    # import tensorflow
    tensorflow
    return []

ds = ray.data.range_tensor(100000, parallelism=1)
ds.map(leak_repro)

Just a quick note, I was able to reproduce this with Ray 1.12.0, so this isn’t a (recent) regression.