pytorch-lightning: `transfer_batch_to_device` doesn't work under DP

🐛 Bug

This is discussed under #1756 and I’m opening a separate issue here for visibility.

In the training loop, for DP/DDP/DDP2, we do not move the data to devices ourselves, but instead use the default scatter to transfer data. This results in transfer_batch_to_device not being called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/16a7326e5259a3cdd20a508c34a0f84806d88f8e/pytorch_lightning/trainer/training_loop.py#L736-L737

Expected behavior

Ideally, we want transfer_batch_to_device to work in all settings. If it’s not possible at all to override this behavior, at least a run-time warning and/or some warning in the doc should be given.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 19 (15 by maintainers)

Most upvoted comments

Not sure if the earlier label removal counts towards a new “activity” by the stale bot, so commenting here to indicate that this is not stale and still needs to be addressed.

However, with DDP, the _step method in ddp_accelerator.py (line 173) calls self.ddp_plugin.on_before_forward(self.trainer.get_model(), *args), where args is again the batch, idx and optim idx. The ddp_plugin passes all arguments to transfer_batch_to_device without doing anything with them. It seems that this could thus maybe be fixed by changing line 173 into self.ddp_plugin.on_before_forward(self.trainer.get_model(), args[0])?

this has been updated in the recent refactors, so won’t be a problem now. you can try master. the new version will be officially released next week I guess.

@rubencart This seems to be unrelated, and should not be a problem for DDP. This part of the code underwent a lot of changes lately. If you don’t mind, would you send us a repro example in a new issue? If you ping me there and it is fixable, I will fix it.

@edenlightning yes. Still not supported for DP AFAIK.