pytorch-lightning: DDP with Hydra multirun doesn't work when dirpath in checkpoint callback is specified

πŸ› Bug

Running DDP with Hydra multirun ends up with β€œKilled” error message when launching the second task:

Epoch 0    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/939 0:00:00 β€’ -:--:-- 0.00it/s [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0    ━━━━━━━━━━━━━━━━ 939/939 0:00:13 β€’        70.53it/s loss: 0.142      
                                    0:00:00                    v_num:           
[2022-01-03 15:21:38,513][src.train][INFO] - Starting testing!
[2022-01-03 15:21:38,514][pytorch_lightning.utilities.distributed][INFO] - Restoring states from the checkpoint path at /home/user/lightning-hydra-template/logs/multiruns/2022-01-03/15-21-17/0/checkpoints/epoch_000.ckpt
[2022-01-03 15:21:38,535][pytorch_lightning.accelerators.gpu][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[2022-01-03 15:21:41,523][HYDRA]        #1 : trainer.max_epochs=1 datamodule.batch_size=64 trainer.gpus=2 +trainer.strategy=ddp
Killed

I experience this ONLY when passing the dirpath parameter to checkpoint callback:

ModelCheckpoint(dirpath="checkpoints/")

Tested for lightning v1.5.7. I believe this issue wasn’t around in one of the previous releases.

This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.

cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

@tchaton Hey, here’s a minimal example: https://github.com/ashleve/lit-hydra-ddp-multirun-bug Run multirun with python main.py -m x=0,1

I was not able to find an easy fix, but here’s what I found:

  1. The process is killed only when using trainer.test(), I suspect the cause might be incorrect ckpt path
  2. The hydra logging folder gets multiplied for each process in ddp: image Here you can see 2 folders with names generated based on time: 16-27-40, 16-27-42. Both of those were generated by single multirun, but there should be only one main folder with multiple subfolders named by job number: 0,1,2… Seems like each DDP process causes hydra to spawn extra multirun.
  3. Not using the dirpath in checkpoint callback makes the trainer.test() execute without issues, but multiple folders still remain.

@carmocca Yes, it should fix this issue.

hi @ashleve - thanks for creating the minimal repro! that was really helpful.

Sounds like there are two issues here:

  1. Hydra changes working dir and as a result the checkpoint cannot be found.
  2. hydra.sweep.dir got created twice somehow in ddp mode.

As for 1, in Hydra 1.2 (the one we are currently working on), we added an option to not changing current working dir. If you run your application with hydra.job.chdir=False, it should work. We’ve recently put out a dev release of Hydra 1.2 . You can install with pip install hydra-core --pre --upgrade in case you want to give that a try.

Hey @jgbos,

Great question. I think the simplest is to create a simple test with the config file, not in the right place and see if you can recover from it.

Best, T.C