torchmetrics: `RuntimeError`s when attempting use of `MAP`-metric
🐛 Bug
When using MAP
-metric during test_step
to update values and test_epoch_end
to compute the values, the MAP.compute
fails with RuntimeError: Tensors must be CUDA and dense
in all_gather
.
To Reproduce
Pick a regional neural network, from for example lightning bolts’ FasterRCNN
. Implement test_step
where
MAP.update()
is called for the model results and targets and the MAP.compute()
is called in the test_epoch_end
for the previously gathered results.
Code sample
from pl_bolts.models.detection.faster_rcnn import FasterRCNN as _FasterRCNN
from torchmetrics.detection.map import MAP
class FasterRCNN(_FasterRCNN):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._voc_map = MAP(iou_thresholds=[0.3, 0.5, 0.7], class_metrics=True)
# <I have here validation_step code, but this should not matter>
def test_step(self, batch, batch_idx) -> Dict[str, float]:
"""Test function to ran over during test interference."""
imgs, targets = batch
outs = self.model(imgs)
# If move to CPU is not done, error is `torch.sort does not support boolean tensor`.
for target in targets:
target["boxes"] = target["boxes"].cpu()
target["labels"] = target["labels"].cpu()
for target in outs:
target["boxes"] = target["boxes"].cpu()
target["labels"] = target["labels"].cpu()
target["scores"] = target["scores"].cpu()
# Update MAP metric created in `__init__`
self._voc_map.update(
outs,
targets
)
# No reason to return anything as the values have been collected to `MAP`
return None
def test_epoch_end(self, output_results):
"""Test function ran after all test steps. Aggregates all test results to mAP."""
voc = self._voc_map.compute() # Fails with `RuntimeError: Tensors must be CUDA and dense`
self.log("mAP", voc)
Expected behavior
The MAP.update
updates values correctly whether or not they had been transferred to GPU device and compute
does not fail with RuntimeError: Tensors must be CUDA and dense
during call to torch.distributed.gather_all
.
Environment
- PyTorch Version (e.g., 1.0): 1.9.1
- PL Bolts Version: 0.4.0
- TorchMetrics: 0.6.1
- OS (e.g., Linux): Ubuntu 18.04
- How you installed PyTorch (
conda
,pip
, source):poetry
installation - Python version: 3.9.6
- CUDA: 11.3:
- NVIDIA GeForce 2080 Ti x 2
Additional context
distributed_backend=nccl
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 15 (5 by maintainers)
Is this issue solved? I got the same error using version 0.7.0. Should I open a new issue, or is this work in progress.
@timurlenk07 mind providing your sample code? otherwise I think it the same as: https://github.com/PyTorchLightning/metrics/issues/608#issuecomment-963153184