pytorch-lightning: torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense

🐛 Bug

Model saving using torch.save not working with ddp accelerator.

To Reproduce

https://github.com/mlflow/mlflow-torchserve/blob/master/examples/IrisClassification/iris_classification.py

The above mentioned example trains the Iris Classification model.

Dependent packages:

torch==1.9.0
torchvision==0.10.0
sklearn
pytorch lightning 1.3.7

Run the example using the following command

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Produces the following error while saving the model

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/mlflow-torchserve/examples/IrisClassification/iris_classification.py", line 127, in <module>
    torch.save(model.state_dict(), "iris.pt")
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1259, in state_dict
    module.state_dict(destination, prefix + name + '.', keep_vars=keep_vars)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 421, in state_dict
    with self.sync_context(dist_sync_fn=self.dist_sync_fn):
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 299, in sync_context
    cache = self.sync(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 272, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 213, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
    return _simple_gather_all_tensors(result, group, world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
    torch.distributed.all_gather(gathered_result, result, group)
  File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

The same script was working for us till 1.2.7. To reproduce

Install pytorch lightning 1.2.7 - pip install pytorch-lightning==1.2.7 and run the same command again

python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp

Now, the model trains and pt file is saved successfully.

Attaching both the logs with NCCL_DEBUG set to INFO for reference ptl_model_save_success_1.2.7.txt ptl_model_save_failure_1.3.7.txt

Expected behavior

Iris classification model trains successfully and the pt file is generated

Environment

  • CUDA:
    • GPU:
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
      • Tesla K80
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.21.0
    • pyTorch_debug: False
    • pyTorch_version: 1.9.0+cu102
    • pytorch-lightning: 1.3.7
    • tqdm: 4.61.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.9.5
    • version: #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020
  • How you installed PyTorch (conda, pip, source): pip
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

Also tried torch.save(trainer.get_model(), "iris.pt"). In Pytorch Lightning 1.3.7 - the same error is getting shown.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Hey @chauhang @shrinath-suresh,

There is a fix on this branch for TorchMetrics: PyTorchLightning/metrics#339.

Mind giving it a try.

Best, T.C

Tested with the fix branch.

Both single gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 1 --accelerator ddp

and multi gpu + ddp

python iris_classification.py --max_epochs 10 --gpus 2 --accelerator ddp

are working as expected iris_classification_multi_gpu_ddp.txt iris_classification_single_gpu_ddp.txt

@tchaton any insights on this warning ?

[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8. Let me know if that helps.

@tchaton we need to find a solution for all_gather in the sync function, as it will only work if the module is on the correct device.