pytorch-lightning: WandbLogger cannot be used with 'ddp'
๐ Bug
wandb modifies init
such that a child process calling init returns None if the master process has called init. This seems to cause a bug with ddp, and results in rank zero having experiment = None, which crashes the program.
To Reproduce
Can be reproduced with the basic MNIST gpu template, simply add a WandbLogger and pass โddpโ as the distributed backend.
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/rmrao/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
self.run_pretrain_routine(model)
File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 757, in run_pretrain_routine
self.logger.log_hyperparams(ref_model.hparams)
File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/base.py", line 14, in wrapped_fn
fn(self, *args, **kwargs)
File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/wandb.py", line 79, in log_hyperparams
self.experiment.config.update(params)
AttributeError: 'NoneType' object has no attribute 'config'
This occurs with the latest wandb version and with pytorch-lightning 0.6.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (14 by maintainers)
This particular problem I think stems from this part of the
wandb.init(...)
code:Child processes end up getting
None
for the wandb run object, which causes logging to fail. There are probably two reasonable and complementary solutions:Right now, this is the only part of the logging code that the parent thread calls (I assume itโs called when pickling):
If this is changed to:
That will ensure that unless the user explicitly logs something or creates the wandb experiment first, then the main thread will not try to create an experiment. Since subsequent logging / saving code is wrapped by the
@rank_zero_only
decorator, this will generally solve the issue in the base case.Itโs also possible that these properties are also called by master. Ideally they would be wrapped to not create the experiment unless it had been already created (i.e. experiment should only be created by a function that is wrapped with the
@rank_zero_only
decorator).wandb
does allow you to reinitialize the experiment. I tried to play around with this a little bit and got some errors, but in theory adding this:should force a re-initialization when wandb is already initialzed for rank zero.
It is solved here #13166.
Just to clarify @parasjโs solution, presumably you also have to pass in some
name
as a keyword arg while constructing the logger.Unfortunately this means youโre now responsible for generating unique names for subsequent runs.