pytorch-lightning: DDP with Hydra multirun doesn't work when dirpath in checkpoint callback is specified
π Bug
Running DDP with Hydra multirun ends up with βKilledβ error message when launching the second task:
Epoch 0 ββββββββββββββββββββββββββββββββββ 0/939 0:00:00 β’ -:--:-- 0.00it/s [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0 ββββββββββββββββ 939/939 0:00:13 β’ 70.53it/s loss: 0.142
0:00:00 v_num:
[2022-01-03 15:21:38,513][src.train][INFO] - Starting testing!
[2022-01-03 15:21:38,514][pytorch_lightning.utilities.distributed][INFO] - Restoring states from the checkpoint path at /home/user/lightning-hydra-template/logs/multiruns/2022-01-03/15-21-17/0/checkpoints/epoch_000.ckpt
[2022-01-03 15:21:38,535][pytorch_lightning.accelerators.gpu][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[2022-01-03 15:21:41,523][HYDRA] #1 : trainer.max_epochs=1 datamodule.batch_size=64 trainer.gpus=2 +trainer.strategy=ddp
Killed
I experience this ONLY when passing the dirpath parameter to checkpoint callback:
ModelCheckpoint(dirpath="checkpoints/")
Tested for lightning v1.5.7. I believe this issue wasnβt around in one of the previous releases.
This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (14 by maintainers)
@tchaton Hey, hereβs a minimal example: https://github.com/ashleve/lit-hydra-ddp-multirun-bug Run multirun with
python main.py -m x=0,1I was not able to find an easy fix, but hereβs what I found:
trainer.test(), I suspect the cause might be incorrect ckpt path16-27-40,16-27-42. Both of those were generated by single multirun, but there should be only one main folder with multiple subfolders named by job number:0,1,2β¦ Seems like each DDP process causes hydra to spawn extra multirun.dirpathin checkpoint callback makes thetrainer.test()execute without issues, but multiple folders still remain.@carmocca Yes, it should fix this issue.
hi @ashleve - thanks for creating the minimal repro! that was really helpful.
Sounds like there are two issues here:
hydra.sweep.dirgot created twice somehow inddpmode.As for 1, in Hydra 1.2 (the one we are currently working on), we added an option to not changing current working dir. If you run your application with
hydra.job.chdir=False, it should work. Weβve recently put out a dev release of Hydra 1.2 . You can install withpip install hydra-core --pre --upgradein case you want to give that a try.Hey @jgbos,
Great question. I think the simplest is to create a simple test with the config file, not in the right place and see if you can recover from it.
Best, T.C