pytorch-lightning: FileNotFoundError for best checkpoint when using DDP with Hydra
🐛 Bug
I am getting a FileNotFoundError for loading the best checkpoint when using trainer.test() after trainer.fit() in DDP mode with Hydra.
My configuration file specifies that hydra.run.dir="/path/to/data/${now:%Y-%m-%d_%H-%M-%S}".
As a result, the first process (rank 0) spawns in “/path/to/data/datetime1” and creates the “ckpts” and “logs” folders there while the second process (rank 1) spawns in “/path/to/data/datetime2” and cannot access the “ckpts” and “logs” folders.
It appears that when calling trainer.test(), the program looks for “/path/to/data/datetime2/ckpts/best.ckpt” which is indeed not there.
Here is the error stack:
Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
[lightning][INFO] - Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.70it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30Saving latest checkpoint...
[lightning][INFO] - Saving latest checkpoint...
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.62it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30]
[__main__][CRITICAL] - [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'
Traceback (most recent call last):
File "/home/azouaoui/github/PL-Hydra-template/train.py", line 41, in main
train(cfg)
File "/home/azouaoui/github/PL-Hydra-template/train.py", line 34, in train
logger.info(trainer.test())
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 720, in test
results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 750, in __test_using_best_weights
ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
with fs.open(path_or_url, "rb") as f:
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
**kwargs
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
self._open()
File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'
Please reproduce using the BoringModel
Error is triggered by using DDP with at least 2 GPUs. Hence I cannot use Colab.
To Reproduce
Use this repository
Have at least 2 GPUs available.
$ git clone https://github.com/inzouzouwetrust/PL-Hydra-DDP-bug
$ cd PL-Hydra-DDP-bug && pip install -r requirements.txt
$ python bug_report_model.py
Expected behavior
I would expect the program to use the subfolder spawned by the first process (rank 0) when loading the best checkpoint.
Environment
* CUDA:
- GPU:
- GeForce GTX TITAN X
- GeForce GTX TITAN X
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.4
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.5
- tqdm: 4.54.1
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.9
- version: #219-Ubuntu SMP Tue Aug 11 12:26:50 UTC 2020
Additional context
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (14 by maintainers)
I have the same error. I’m using DDP on two GPUs server and I get the following error:
It seems that the file does not exists when I’m trying to load it, and in fact it is created just after the program crashed.
I don’t know if it is due to Hydra or not … However, I’m debugging the code (the problematic part is the following):
When I’m debugging, when one subprocess reaches the
_loadfunction insidecloud_io.py, the file doesn’t exist yet as another process is in the function for writing it. How can I wait for the checkpoint to be written before trying to load the weights?I am proposing to do it when spawning the ddp. basically, at that point - os.getcwd() should point to the actual output directory generated by Hydra. I will probably provide an API to access the generated output directory in the future, but it’s not there yet and os.getcwd() is pretty close.
Side issues: You can also disable the .hydra directroy and the logging configuration for the child processes. See https://github.com/facebookresearch/hydra/issues/910 for workarounds:
I have also run into the issue @awaelchli mentioned independent of hydra launching. Thanks for the fix! I’ll try pulling it down to see if it makes a difference in this context 🙂 .
Hey @inzouzouwetrust could you reproduce via the bug_report_model I shared with you and paste it here? Will help me debug
EDIT: it’s in the main issue, missed it… I assume it needs the config’s to be specified in the repo as well!