pytorch-lightning: Tensorboard logging in multi-gpu setting not working properly?

Hi there 😃

I have a question (that may be an issue with the code or just my ignorance). (b.t.w. I am using the latest version, pytorch-lightning==0.4.9)

If I set the trainer

trainer = Trainer(experiment=exp, gpus=[0])

I can see the corresponding logging (scalars and hyperparameters) in Tensorboard. If I change it to distributed training (keeping the rest of the code unchanged) :

trainer = Trainer(experiment=exp, gpus=[0,1], distributed_backend='ddp')

The Tensorboard logging stops working at least for scalars and hyperparameters, I see nothing except the experiment name.

In both cases ‘exp’ is a Experiment instantiated like this:

exp = Experiment(save_dir=/SOME/PATH, name=NAME, version=VERSION, description=DESCRIPTION)

This picture illustrates the problem.

Untitled

In the picture the red arrows point to the “distributed” experiment, with no drawing in the chart. The other two (the ones that appear in the chart) are the very same, except than the run in single GPU.

Am I missing something or do I need to add extra configuration to make the logging work in multi-gpu with the ddp setting? Or is it a bug?

Thank you! 😃

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Ok, submitted a PR. Can you install this version and verify it works now?

pip install git+https://github.com/williamFalcon/pytorch-lightning.git@fix_tb_logger_rank --upgrade