ray: [core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster
What happened + What you expected to happen
This has been discussed in #30414 at the end, but it seems more approriate to start a new issue because it is probably unrelated.
I run a cluster with a head node, and worker nodes. The head node gets a CLI ray start ...options, the worker nodes also do a CLI ray start .. options and connect to the head. All the worker nodes have GPUs.
On the head node I then run a tune script with GPU request. After ctrl+c, the GPU does not seem to be freed on the worker nodes, and IDLE or TRAIN processes remain that block the GPU memory. Only a full ray stop kills all the processes.
I tested this also in a much more simple setup (just a head node with a GPU), run the tune script, do ctrl+c, and the memory on the GPU remains blocked and ray:TRAIN processes remain (see #30414).
EDIT: The issue was found to be related to num_workers>0 in pytorch DataLoader, which leaves extra ray processes open after ctrl+c.. Related: https://github.com/ray-project/ray_lightning/issues/87, https://github.com/pytorch/pytorch/issues/66482
EDIT 2: I could solve the issue by using 515.xxx NVIDIA drivers (but only for the main node), but for 470.xxx and/or the worker nodes the issue seems to remain.
EDIT 3: The issue persists, irregardless of driver version
Versions / Dependencies
Ray 2.2.0 Pytorch 1.12.1 NVIDIA driver 470.xxx
Reproduction script
Something like
if(args.resume):
tuner = Tuner.restore(
path=os.path.join(results_dir, "test")
)
tuner.fit()
else:
trainable=tune.with_resources(trainable, resources={"cpu": 4, "gpu": 1})
failure_config=FailureConfig(max_failures=-1)
run_config=RunConfig(name="test",
local_dir=results_dir,
failure_config=failure_config,
log_to_file=True
)
tune_config=TuneConfig(num_samples=1,
reuse_actors=False
)
tuner=Tuner(new_trainable, run_config=run_config, tune_config=tune_config, param_space=cfg)
tuner.fit()
Any tune job that runs on a node that is previously started with ray start --num_cpus=4 --num_gpus=1.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (11 by maintainers)
Commits related to this issue
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This a... — committed to ray-project/ray by cadedaniel a year ago
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This a... — committed to cadedaniel/ray by cadedaniel a year ago
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) (#34181) We kill all child processes when a Ray worker process exit... — committed to ray-project/ray by cadedaniel a year ago
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This a... — committed to ArturNiederfahrenhorst/ray by cadedaniel a year ago
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This a... — committed to elliottower/ray by cadedaniel a year ago
- [CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This a... — committed to ProjectsByJackHe/ray by cadedaniel a year ago
Hi all, I have an update for this issue:
We merged a partial fix into master and expect it to make out in Ray 2.4. On Linux, in the case where the driver script is cancelled or exits normally, each Ray worker process will now kill its immediate child processes. Although we could not reproduce the Torch dataloader process leak described here, we believe this will fix the Torch issue and free the previously reserved GPU memory.
We have plans for a more holistic approach to handle cases where the worker processes crash and leak processes, and where child processes cause leaks by spawning child processes of their own. Please reach out if you are experiencing these issues.
Follow the below issues for updates. Thanks!
atexithandlers are guaranteed to run if there’s no segfault. https://github.com/ray-project/ray/issues/34124I stopped previous raytune run by pressing Ctrl-C multiple times then next tune run had OOM, maybe was the issues. Now I always run “ray stop” after cancelling experiments with SIGINT, everything is fine now.
This can be reproduced without GPUs, though in practice I think this is more noticeable when using GPUs because the GRAM is held.
Minimal repro:
When executing the script:
4240is the original Actor process.4287is the spawned process.After terminating the script with
ctrl+C:Let’s start by fixing the simple multiprocessing repro case?
On Thu, Mar 9, 2023, 7:18 PM Cade Daniel @.***> wrote:
I’ve spent some time with Ray + Torch dataloader and can’t reproduce the reported behavior. I think it’s possible for Ray Lightning + Lightning + Torch Dataloader to have the issue when Ray + Torch dataloader doesn’t, as the Ray Lightning integration overrides some cleanup logic in the default Lightning.
Things I’ve tried:
init; the aforementioned health check inside the dataloader worker processes kills them after a few seconds.Nonein their input queues), but I haven’t seen any obvious case.I will try an end-to-end example on DataLoader + Ray Tune + Ray Lightning + Lightning tomorrow. I have also been trying exclusively in Ray Jobs and should also try in Ray Client.
it’s interesting that pytorch’s dataloader subprocess does health check with its parent https://github.com/pytorch/pytorch/pull/6606 , so it suppose to terminate itself it the parent dies.
https://discuss.pytorch.org/t/when-i-shut-down-the-pytorch-program-by-kill-i-encountered-the-problem-with-the-gpu/6315/2
@cadedaniel the processes are in
SNlstate (from the prior thread)Hi @thoglu, could you run a
ps auxand report the state of the ray processes when you experience the leak? Specifically I am looking to see if any of the processes with GPU resources are stuck inDstate. This indicates an nvidia kernel driver bug and makes the processes unkillable without force unloading the driver first (IIRC, last time I worked with this was 2021).@ericl @matthewdeng
Ok I could solve this issue by updating the NVIDIA driver to a new version (515.85.01) … the old driver version was of the 470.xxx series. It seems for some reason the driver impacts how tune, pytorch and potentially lightning interact fornum_workers>0. However, there might be a fix that works for older drivers aswell?EDIT It actually did not solve the issue, I just ran the job too shortly. After starting the dataloader with
num_workers>0,the same issue appears with the new driver aswell.Yeah, I think there is an underlying process management bug here when those workers are forked. I’ll keep the P1 tag. @scv119 , is this something we can slot for 2.3-2.4?
@matthewdeng @ericl Indeed, it is related to
num_workers>0. I changed num_workers to 0 and did not see the problem… I should have seen that earlier, many thanks for your quick help guys!So it is the same issue as https://github.com/ray-project/ray_lightning/issues/87. Is there any hope that this will get solved at all? The issue is open since over a year already.
It is not even a lightning issue, but a “Dataloader in connection to ray” issue, right? There must be other people seeing this already … I presume
num_workers>0is prevalent in many use cases … for myself it speeds up training significantly.@thoglu could you share what your trainable definition looks like?
Hmm no this one does not have the leak, I will try a few things out tomorrow and work my way toward my own situation ( its too late here right now).