wandb: Logging error on Hydra multiruns using Pytorch lightning

When I do a multirun on Hydra (with my model using Pytorch lightning), the program always crashes after (at the end of) the second run. The problem seems to be Wandb closing a logger prematurely - if I remove Wandb Logger the problem dissapears.

I don’t have a simple reproducible example (however I can provide more info if necessary), but my main file is very simple:

from omegaconf import DictConfig, OmegaConf, open_dict
import hydra

from classification.data import Data
from classification.model import Classifier
from classification.utils import get_trainer

@hydra.main(config_name="config")
def run_train(hparams: DictConfig):
    
    data = Data(batch_size=hparams.batch_size)

    data.prepare_data()
    data.setup("fit")
    
    with open_dict(hparams):
        hparams.len_train = len(data.train_dataloader())

    model = Classifier(hparams)
    
    trainer = get_trainer(hparams)    
    
    trainer.fit(model, data)


if __name__ == "__main__":
    run_train()

Here’s the error log:

Error log
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit
  stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/wandb/lib/redirect.py", line 23, in write
  self.stream.write(data)
ValueError: I/O operation on closed file.
Call stack:
File "hydra_run.py", line 31, in <module>
  run_train()
File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
  _run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 354, in _run_hydra
  run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
  return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
  lambda: hydra.multirun(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
  return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
  results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 76, in launch
  ret = run_job(
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
  ret.return_value = task_function(task_cfg)
File "hydra_run.py", line 24, in run_train
  trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
  result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
  results = self.accelerator_backend.train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
  results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
  self.train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 416, in train
  self.run_training_teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1136, in run_training_teardown
  log.info('Saving latest checkpoint..')
Message: 'Saving latest checkpoint..'
Arguments: ()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 247, in _flush_loggers
  h_weak_ref().flush()
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1065, in flush
  self.stream.flush()
ValueError: I/O operation on closed file.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 23 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Hello @vanpelt,

Good news, with WANDB_START_METHOD=thread the minimum working example works. (in case of further problems on production code that uses lightning I’ll post again here)

You can find the mwe here

Are there any contraindications in using always WANDB_START_METHOD=thread?

Hi folks, I’m still seeing this issue if you do more than 2 runs with multirun. See this gist, where i adapted @borisdayma 's example to make it a bit more verbose. System: Python 3.7.9, wandb 0.10.10, hydra 1.0.3.

You can see the output is printed in the first 2 runs but disappears for the 3d (and all consecutive) runs. Output disappears both for plain print statements and using the hydra-configured python logger. And a similar error is raised OSError: [Errno 9] Bad file descriptor. In real runs, it seems that the jobs still run and sync to wandb even when all stdout is gone.

Hey @lucmos we just added this start method so we’re still evaluating performance around it. The default start method is spawn which has better semantics / logic seperation but can interact poorly especially if your code is leveraging multiprocessing as well. We’re planning to atleast detect this case and automatically use thread when appropriate.

@lucmos Thank you, WANDB_START_METHOD=thread solve my issue.

Thanks for taking a look at it. I’ve already answered under hydra issue, but I’ll do it here too since the bug is probably on wandb’s side:

I’ve isolated the error, and it seems to be caused by the reinit=True option in wandb.init() (this is what PL does by default). Setting it to False removes the error.

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.88. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Hey @lucmos with the most recent release of wandb, can you try setting the WANDB_START_METHOD environment variable to “thread” and see if you’re still having the same issue?

We’re also doing some work on multiprocessing and calling wandb.init in both a parent and child process that could be relevant here. If you could provide us with a colab or code snippet to reproduce the issue it would really help us get to the bottom of it.

Nope, I have no progress on this issue. However, the issue doesn’t occur when running with hydra’s --multirun and the slurm launcher (note default launcher is running things locally sequentially in the same process). Quickly looking at the hydra/launcher=joblib, I think this would probably avoid the issue too – since Joblib.parallel spins up several python subprocesses.

Thanks @omry for the clarifications. It makes sense. We’re looking into a workaround that would not require closing the run. cc @raubitsj

The intent is to flush all loggers, not stdout.