pytorch-lightning: FileNotFoundError for best checkpoint when using DDP with Hydra

🐛 Bug

I am getting a FileNotFoundError for loading the best checkpoint when using trainer.test() after trainer.fit() in DDP mode with Hydra.

My configuration file specifies that hydra.run.dir="/path/to/data/${now:%Y-%m-%d_%H-%M-%S}". As a result, the first process (rank 0) spawns in “/path/to/data/datetime1” and creates the “ckpts” and “logs” folders there while the second process (rank 1) spawns in “/path/to/data/datetime2” and cannot access the “ckpts” and “logs” folders. It appears that when calling trainer.test(), the program looks for “/path/to/data/datetime2/ckpts/best.ckpt” which is indeed not there.

Here is the error stack:

Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
[lightning][INFO] - Epoch 4: val_acc reached 30.00000 (best 30.00000), saving model to /home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-33/ckpts/epoch=004-val_acc=30.000.ckpt as top 1
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.70it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30Saving latest checkpoint...                                                                               
[lightning][INFO] - Saving latest checkpoint...
Epoch 4: 100%|███████████| 4/4 [00:00<00:00, 11.62it/s, loss=2.729, v_num=0, val_acc=30, best_val_acc=30]
[__main__][CRITICAL] - [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'
Traceback (most recent call last):
  File "/home/azouaoui/github/PL-Hydra-template/train.py", line 41, in main
    train(cfg)
  File "/home/azouaoui/github/PL-Hydra-template/train.py", line 34, in train
    logger.info(trainer.test())
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 720, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 750, in __test_using_best_weights
    ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
    with fs.open(path_or_url, "rb") as f:
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
    **kwargs
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
    self._open()
  File "/scratch/artemis/azouaoui/miniconda3/envs/jz/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/azouaoui/github/PL-Hydra-template/data/runs/2021-01-14_08-03-35/ckpts/epoch=004-val_acc=30.000.ckpt'

Please reproduce using the BoringModel

Error is triggered by using DDP with at least 2 GPUs. Hence I cannot use Colab.

To Reproduce

Use this repository

Have at least 2 GPUs available.

$ git clone https://github.com/inzouzouwetrust/PL-Hydra-DDP-bug
$ cd PL-Hydra-DDP-bug && pip install -r requirements.txt
$ python bug_report_model.py

Expected behavior

I would expect the program to use the subfolder spawned by the first process (rank 0) when loading the best checkpoint.

Environment

* CUDA:
        - GPU:
                - GeForce GTX TITAN X
                - GeForce GTX TITAN X
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.4
        - pyTorch_debug:     True
        - pyTorch_version:   1.7.0
        - pytorch-lightning: 1.0.5
        - tqdm:              4.54.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.7.9
        - version:           #219-Ubuntu SMP Tue Aug 11 12:26:50 UTC 2020

Additional context

  • For further details, please take a look at my recent chat with Hydra main author on Zulip.
  • Take a look at this PL forums topic.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

I have the same error. I’m using DDP on two GPUs server and I get the following error:

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                   
  File "/equilibrium/evivoli/asmara/src/models/train_model.py", line 125, in <module>                                                                                                                                                                                                                                                
    train()                                                                                                                                                                                                                                                                                                                          
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/main.py", line 94, in decorated_main                                                                                                                                                                                                                 
    _run_hydra(                                                                                                                                                                                                                                                                                                                      
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra                                                                                                                                                                                                         
    _run_app(                                                                                                                                                                                                                                                                                                                        
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 457, in _run_app                                                                                                                                                                                                           
    run_and_report(                                                                                                                                                                                                                                                                                                                  
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report                                                                                                                                                                                                     
    raise ex                                                                                                                                                                                                                                                                                                                         
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report                                                                                                                                                                                                     
    return func()
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/equilibrium/evivoli/asmara/src/models/train_model.py", line 119, in train
    model = model.load_from_checkpoint(best_checkpoint_path)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
    return _load_from_checkpoint(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 160, in _load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=map_location)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/lightning_fabric/utilities/cloud_io.py", line 47, in _load
    with fs.open(path_or_url, "rb") as f:
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/spec.py", line 1135, in open
    f = self._open(
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 183, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 285, in __init__
    self._open()
  File "/home/evivoli/miniconda3/envs/new-env/lib/python3.9/site-packages/fsspec/implementations/local.py", line 290, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/equilibrium/evivoli/asmara/.checkpoints/unet/multi/0-holograms/epoch=00-val_loss=2.66-val_acc=0.08.ckpt'

It seems that the file does not exists when I’m trying to load it, and in fact it is created just after the program crashed.

I don’t know if it is due to Hydra or not … However, I’m debugging the code (the problematic part is the following):

checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    dirpath=f'{BASEPATH}/.checkpoints/{cfg.model.name}/{cfg.data.task}/{cfg.seed}-{cfg.data.dataset}/',
    filename='{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}',
    save_top_k=3,
    mode='min'
)

# popolate the config trainer with configurations
trainer = pl.Trainer(
    **cfg.trainer, 
    # when strategy:'ddp' and find_unused_parameters:False, 
    strategy= DDPStrategy(find_unused_parameters=False) 
        if cfg.custom_trainer.strategy == 'ddp' and 
            cfg.custom_trainer.find_unused_parameters == False 
        else 'ddp',
    logger = wandb_logger,
    callbacks=[early_stop_callback, checkpoint_callback],
)

trainer.fit(model, train_loader, val_loader)

# Load the best checkpoint
best_checkpoint_path = trainer.checkpoint_callback.best_model_path
print(f"Loading best checkpoint from {best_checkpoint_path}")
model = model.load_from_checkpoint(best_checkpoint_path)

When I’m debugging, when one subprocess reaches the _load function inside cloud_io.py, the file doesn’t exist yet as another process is in the function for writing it. How can I wait for the checkpoint to be written before trying to load the weights?

I don’t like having to force users to remove %s from their run dir because we should strive to have Hydra work with little friction. @omry I don’t think manually appending hydra.run.dir=os.getcwd() would fix the issue as this would override the user’s specified run directory right?

I am proposing to do it when spawning the ddp. basically, at that point - os.getcwd() should point to the actual output directory generated by Hydra. I will probably provide an API to access the generated output directory in the future, but it’s not there yet and os.getcwd() is pretty close.

  • User start app
  • Hydra generates output dir and chdir
  • User function is running with the output directory as the cwd
  • User is calling PL to spawn DDP, cwd is still in the working dir.
  • PL append hydra.run.dir=os.getcwd() to child processes, ensuring that they share the same working directory as the parent process.

Side issues: You can also disable the .hydra directroy and the logging configuration for the child processes. See https://github.com/facebookresearch/hydra/issues/910 for workarounds:

I have also run into the issue @awaelchli mentioned independent of hydra launching. Thanks for the fix! I’ll try pulling it down to see if it makes a difference in this context 🙂 .

Hey @inzouzouwetrust could you reproduce via the bug_report_model I shared with you and paste it here? Will help me debug

EDIT: it’s in the main issue, missed it… I assume it needs the config’s to be specified in the repo as well!