pytorch-lightning: FileNotFoundError for best checkpoint when using DDP with Hydra

🐛 Bug

I am getting a FileNotFoundError for loading the best checkpoint when using trainer.test() after trainer.fit() in DDP mode with Hydra.

My configuration file specifies that hydra.run.dir="/path/to/data/${now:%Y-%m-%d_%H-%M-%S}". As a result, the first process (rank 0) spawns in “/path/to/data/datetime1” and creates the “ckpts” and “logs” folders there while the second process (rank 1) spawns in “/path/to/data/datetime2” and cannot access the “ckpts” and “logs” folders. It appears that when calling trainer.test(), the program looks for “/path/to/data/datetime2/ckpts/best.ckpt” which is indeed not there.

Here is the error stack:

Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
[lightning][INFO] - Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.70it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30Saving latest checkpoint...                                                                               
[lightning][INFO] - Saving latest checkpoint...
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.62it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30]
[__main__][CRITICAL] - [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'
Traceback (most recent call last):
  File "/home/azouaoui/github/PL-Hydra-template/train.py", line 41, in main
    train(cfg)
  File "/home/azouaoui/github/PL-Hydra-template/train.py", line 34, in train
    logger.info(trainer.test())
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 720, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 750, in __test_using_best_weights
    ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
    with fs.open(path_or_url, "rb") as f:
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
    **kwargs
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
    self._open()
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'

Please reproduce using the BoringModel

Error is triggered by using DDP with at least 2 GPUs. Hence I cannot use Colab.

To Reproduce

Use this repository

Have at least 2 GPUs available.

$ git clone https://github.com/inzouzouwetrust/PL-Hydra-DDP-bug
$ cd PL-Hydra-DDP-bug && pip install -r requirements.txt
$ python bug_report_model.py

Expected behavior

I would expect the program to use the subfolder spawned by the first process (rank 0) when loading the best checkpoint.

Environment

* CUDA:
        - GPU:
                - GeForce GTX TITAN X
                - GeForce GTX TITAN X
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.4
        - pyTorch_debug:     True
        - pyTorch_version:   1.7.0
        - pytorch-lightning: 1.0.5
        - tqdm:              4.54.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.7.9
        - version:           #219-Ubuntu SMP Tue Aug 11 12:26:50 UTC 2020

Additional context

For further details, please take a look at my recent chat with Hydra main author on Zulip.
Take a look at this PL forums topic.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (14 by maintainers)

Most upvoted comments

I have the same error. I’m using DDP on two GPUs server and I get the following error:

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                   
  File "/equilibrium/evivoli/asmara/src/models/train_model.py", line 125, in <module>                                                                                                                                                                                                                                                
    train()                                                                                                                                                                                                                                                                                                                          
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main                                                                                                                                                                                                                 
    _run_hydra(                                                                                                                                                                                                                                                                                                                      
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra                                                                                                                                                                                                         
    _run_app(                                                                                                                                                                                                                                                                                                                        
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app                                                                                                                                                                                                           
    run_and_report(                                                                                                                                                                                                                                                                                                                  
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report                                                                                                                                                                                                     
    raise ex                                                                                                                                                                                                                                                                                                                         
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report                                                                                                                                                                                                     
    return func()
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/equilibrium/evivoli/asmara/src/models/train_model.py", line 119, in train
    model = model.load_from_checkpoint(best_checkpoint_path)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
    return _load_from_checkpoint(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 160, in _load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=map_location)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/lightning_fabric/utilities/cloud_io.py", line 47, in _load
    with fs.open(path_or_url, "rb") as f:
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/spec.py", line 1135, in open
    f = self._open(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 183, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 285, in __init__
    self._open()
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 290, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/equilibrium/evivoli/asmara/.checkpoints/unet/multi/0-holograms/epoch=00-val_loss=2.66-val_acc=0.08.ckpt'

It seems that the file does not exists when I’m trying to load it, and in fact it is created just after the program crashed.

I don’t know if it is due to Hydra or not … However, I’m debugging the code (the problematic part is the following):

checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    dirpath=f'{BASEPATH}/.checkpoints/{cfg.model.name}/{cfg.data.task}/{cfg.seed}-{cfg.data.dataset}/',
    filename='{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}',
    save_top_k=3,
    mode='min'
)

# popolate the config trainer with configurations
trainer = pl.Trainer(
    **cfg.trainer, 
    # when strategy:'ddp' and find_unused_parameters:False, 
    strategy= DDPStrategy(find_unused_parameters=False) 
        if cfg.custom_trainer.strategy == 'ddp' and 
            cfg.custom_trainer.find_unused_parameters == False 
        else 'ddp',
    logger = wandb_logger,
    callbacks=[early_stop_callback, checkpoint_callback],
)

trainer.fit(model, train_loader, val_loader)

# Load the best checkpoint
best_checkpoint_path = trainer.checkpoint_callback.best_model_path
print(f"Loading best checkpoint from {best_checkpoint_path}")
model = model.load_from_checkpoint(best_checkpoint_path)

When I’m debugging, when one subprocess reaches the _load function inside cloud_io.py, the file doesn’t exist yet as another process is in the function for writing it. How can I wait for the checkpoint to be written before trying to load the weights?

emanuelevivoli on Mar 3, 2023

I don’t like having to force users to remove %s from their run dir because we should strive to have Hydra work with little friction. @omry I don’t think manually appending hydra.run.dir=os.getcwd() would fix the issue as this would override the user’s specified run directory right?

I am proposing to do it when spawning the ddp. basically, at that point - os.getcwd() should point to the actual output directory generated by Hydra. I will probably provide an API to access the generated output directory in the future, but it’s not there yet and os.getcwd() is pretty close.

User start app
Hydra generates output dir and chdir
User function is running with the output directory as the cwd
User is calling PL to spawn DDP, cwd is still in the working dir.
PL append hydra.run.dir=os.getcwd() to child processes, ensuring that they share the same working directory as the parent process.

Side issues: You can also disable the .hydra directroy and the logging configuration for the child processes. See https://github.com/facebookresearch/hydra/issues/910 for workarounds:

omry on Jan 22, 2021

I have also run into the issue @awaelchli mentioned independent of hydra launching. Thanks for the fix! I’ll try pulling it down to see if it makes a difference in this context 🙂 .

romesco on Jan 14, 2021

Hey @inzouzouwetrust could you reproduce via the bug_report_model I shared with you and paste it here? Will help me debug

EDIT: it’s in the main issue, missed it… I assume it needs the config’s to be specified in the repo as well!

SeanNaren on Jan 14, 2021