ray: [Train] TorchTrainer does not free all GPUs on shutdown

What happened + What you expected to happen

I have set up an experiment where I use a TorchTrainer (to enable DDP with eight GPUs) with ASHAv2 scheduler. Each trial is allocated all eight GPUs available on the node. The grace_period=1 so each trial is run for just one epoch before it is preempted by another PENDING trial.

After a few trials are run until the end of the first milestone, the trainer fails to clear the memory of only one of the GPUs, which causes a CUDA out of memory error for the next trial. This error shows up at different times when I rerun the experiment, e.g., once the memory is cleared correctly for the first five experiments but fails on the sixth, and in another case, this issue occurs on the tenth trial.

To mitigate the issue, I added a wait_for_gpu() call at the beginning of my worker function. However, the GPU whose memory is not freed prints the following lines before the program is terminated:

(RayTrainWorker pid=142525) 2023-02-21 17:55:54,271     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:55:59,372     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:04,477     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:09,581     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:14,683     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:19,782     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:24,882     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:29,985     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:35,090     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:40,193     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:45,294     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:50,398     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:56:55,501     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:00,607     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:05,711     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:10,815     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:15,918     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:21,020     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781
(RayTrainWorker pid=142525) 2023-02-21 17:57:26,121     INFO util.py:549 -- Waiting for GPU util to reach 0.01. Util: 0.781

The other seven GPUs don’t suffer from the said issue.

I reran the experiment a few times, both with and without the wait_for_gpu() call, and experienced the same behavior every time.

Versions / Dependencies

Ray 2.2.0 Python 3.10.8 PyTorch 1.13.1 Ubuntu 22.04

Reproduction script

Will provide the script ASAP.

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

@justinvyu, sure!

def run_worker_helper(args, config):
    if not isinstance(config, dict):
        raise ValueError(
            f"Input 'config' is not a dict, recieved {type(config)}"
        )
    args.tune_config = config

    hyperopt_to_ray(config)
    checkpoint_to_args(config, args)

    rank = session.get_local_rank()
    world_size = session.get_local_world_size()
    run_worker(rank, world_size, args)

where run_worker() is a function I normally use for DDP in PyTorch. run_worker creates an object of a class that deals with the training loop, validation, testing, logging, checkpointing, etc. My code uses a fork of this repository; you can find the class I refer to here.

Here is the gist of run_worker:

def run_worker(rank, world_size, args):
    process_group_params = dict(rank=rank, world_size=world_size)
    app = ClassifierCompressorSampleApp(
        args,
        script_dir=os.path.dirname(__file__),
        process_group_params=process_group_params,
    )
    app.run_training_loop()
    if args.tune == "":
        app.test()
        dist.destroy_process_group()

The team is looking into properly terminating subprocesses, but more investigation is needed to understand how to do so.

Though based on your original observations and the discussion in the other thread, I am wondering if there is a particular codepath in the trial pausing flow that is (sometimes) causing non-graceful termination. @Yard1 do you know? Something like what’s controlled by TUNE_FORCE_TRIAL_CLEANUP_S.

Ah okay I think that’s likely it - please try with num_workers=0 as I believe num_workers=1 will still end up launching 1 subprocess, which could run into the same issue.