pytorch-lightning: torch.save with ddp accelerator throwing RuntimeError: Tensors must be CUDA and dense
🐛 Bug
Model saving using torch.save not working with ddp accelerator.
To Reproduce
The above mentioned example trains the Iris Classification model.
Dependent packages:
torch==1.9.0
torchvision==0.10.0
sklearn
pytorch lightning 1.3.7
Run the example using the following command
python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp
Produces the following error while saving the model
--------------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/mlflow-torchserve/examples/IrisClassification/iris_classification.py", line 127, in <module>
torch.save(model.state_dict(), "iris.pt")
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1259, in state_dict
module.state_dict(destination, prefix + name + '.', keep_vars=keep_vars)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 421, in state_dict
with self.sync_context(dist_sync_fn=self.dist_sync_fn):
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/contextlib.py", line 117, in __enter__
return next(self.gen)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 299, in sync_context
cache = self.sync(
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 272, in sync
self._sync_dist(dist_sync_fn, process_group=process_group)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/metric.py", line 213, in _sync_dist
output_dict = apply_to_collection(
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in apply_to_collection
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 195, in <dictcomp>
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/data.py", line 191, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 124, in gather_all_tensors
return _simple_gather_all_tensors(result, group, world_size)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torchmetrics/utilities/distributed.py", line 94, in _simple_gather_all_tensors
torch.distributed.all_gather(gathered_result, result, group)
File "/home/ubuntu/anaconda3/envs/pytorch_1.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense
The same script was working for us till 1.2.7. To reproduce
Install pytorch lightning 1.2.7 - pip install pytorch-lightning==1.2.7 and run the same command again
python iris_classification.py --max_epochs 30 --gpus 1 --accelerator ddp
Now, the model trains and pt file is saved successfully.
Attaching both the logs with NCCL_DEBUG set to INFO for reference ptl_model_save_success_1.2.7.txt ptl_model_save_failure_1.3.7.txt
Expected behavior
Iris classification model trains successfully and the pt file is generated
Environment
- CUDA:
- GPU:
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.21.0
- pyTorch_debug: False
- pyTorch_version: 1.9.0+cu102
- pytorch-lightning: 1.3.7
- tqdm: 4.61.1
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.5
- version: #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020
- How you installed PyTorch (
conda,pip, source): pip - If compiling from source, the output of
torch.__config__.show(): - Any other relevant information:
Additional context
Also tried torch.save(trainer.get_model(), "iris.pt"). In Pytorch Lightning 1.3.7 - the same error is getting shown.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (11 by maintainers)
Tested with the fix branch.
Both single gpu + ddp
python iris_classification.py --max_epochs 10 --gpus 1 --accelerator ddpand multi gpu + ddp
python iris_classification.py --max_epochs 10 --gpus 2 --accelerator ddpare working as expected iris_classification_multi_gpu_ddp.txt iris_classification_single_gpu_ddp.txt
@tchaton any insights on this warning ?
@shrinath-suresh I added device_ids in #8165 and this warning will disappear. It only shows for torch > 1.8. Let me know if that helps.
@tchaton we need to find a solution for
all_gatherin the sync function, as it will only work if the module is on the correct device.