wandb: Logging error on Hydra multiruns using Pytorch lightning
When I do a multirun on Hydra (with my model using Pytorch lightning), the program always crashes after (at the end of) the second run. The problem seems to be Wandb closing a logger prematurely - if I remove Wandb Logger the problem dissapears.
I don’t have a simple reproducible example (however I can provide more info if necessary), but my main file is very simple:
from omegaconf import DictConfig, OmegaConf, open_dict
import hydra
from classification.data import Data
from classification.model import Classifier
from classification.utils import get_trainer
@hydra.main(config_name="config")
def run_train(hparams: DictConfig):
data = Data(batch_size=hparams.batch_size)
data.prepare_data()
data.setup("fit")
with open_dict(hparams):
hparams.len_train = len(data.train_dataloader())
model = Classifier(hparams)
trainer = get_trainer(hparams)
trainer.fit(model, data)
if __name__ == "__main__":
run_train()
Here’s the error log:
Error log
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit
stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/wandb/lib/redirect.py", line 23, in write
self.stream.write(data)
ValueError: I/O operation on closed file.
Call stack:
File "hydra_run.py", line 31, in <module>
run_train()
File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 354, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 355, in <lambda>
lambda: hydra.multirun(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 136, in multirun
return sweeper.sweep(arguments=task_overrides)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 154, in sweep
results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/core_plugins/basic_launcher.py", line 76, in launch
ret = run_job(
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
ret.return_value = task_function(task_cfg)
File "hydra_run.py", line 24, in run_train
trainer.fit(model, data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 416, in train
self.run_training_teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1136, in run_training_teardown
log.info('Saving latest checkpoint..')
Message: 'Saving latest checkpoint..'
Arguments: ()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 247, in _flush_loggers
h_weak_ref().flush()
File "/opt/conda/lib/python3.8/logging/__init__.py", line 1065, in flush
self.stream.flush()
ValueError: I/O operation on closed file.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 23 (5 by maintainers)
Hello @vanpelt,
Good news, with
WANDB_START_METHOD=threadthe minimum working example works. (in case of further problems on production code that uses lightning I’ll post again here)You can find the mwe here
Are there any contraindications in using always
WANDB_START_METHOD=thread?Hi folks, I’m still seeing this issue if you do more than 2 runs with multirun. See this gist, where i adapted @borisdayma 's example to make it a bit more verbose. System: Python 3.7.9, wandb 0.10.10, hydra 1.0.3.
You can see the output is printed in the first 2 runs but disappears for the 3d (and all consecutive) runs. Output disappears both for plain
printstatements and using the hydra-configured python logger. And a similar error is raisedOSError: [Errno 9] Bad file descriptor. In real runs, it seems that the jobs still run and sync to wandb even when all stdout is gone.Hey @lucmos we just added this start method so we’re still evaluating performance around it. The default start method is spawn which has better semantics / logic seperation but can interact poorly especially if your code is leveraging multiprocessing as well. We’re planning to atleast detect this case and automatically use
threadwhen appropriate.@lucmos Thank you,
WANDB_START_METHOD=threadsolve my issue.Thanks for taking a look at it. I’ve already answered under hydra issue, but I’ll do it here too since the bug is probably on wandb’s side:
I’ve isolated the error, and it seems to be caused by the
reinit=Trueoption inwandb.init()(this is what PL does by default). Setting it toFalseremoves the error.Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.88. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
Hey @lucmos with the most recent release of wandb, can you try setting the
WANDB_START_METHODenvironment variable to “thread” and see if you’re still having the same issue?We’re also doing some work on multiprocessing and calling wandb.init in both a parent and child process that could be relevant here. If you could provide us with a colab or code snippet to reproduce the issue it would really help us get to the bottom of it.
Nope, I have no progress on this issue. However, the issue doesn’t occur when running with hydra’s
--multirunand the slurm launcher (note default launcher is running things locally sequentially in the same process). Quickly looking at thehydra/launcher=joblib, I think this would probably avoid the issue too – since Joblib.parallel spins up several python subprocesses.Thanks @omry for the clarifications. It makes sense. We’re looking into a workaround that would not require closing the run. cc @raubitsj
The intent is to flush all loggers, not stdout.