torchmetrics: `RuntimeError`s when attempting use of `MAP`-metric

🐛 Bug

When using MAP-metric during test_step to update values and test_epoch_end to compute the values, the MAP.compute fails with RuntimeError: Tensors must be CUDA and dense in all_gather.

To Reproduce

Pick a regional neural network, from for example lightning bolts’ FasterRCNN. Implement test_step where MAP.update() is called for the model results and targets and the MAP.compute() is called in the test_epoch_end for the previously gathered results.

Code sample

from pl_bolts.models.detection.faster_rcnn import FasterRCNN as _FasterRCNN
from torchmetrics.detection.map import MAP

class FasterRCNN(_FasterRCNN):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._voc_map = MAP(iou_thresholds=[0.3, 0.5, 0.7], class_metrics=True)
    # <I have here validation_step code, but this should not matter>
    def test_step(self, batch, batch_idx) -> Dict[str, float]:
        """Test function to ran over during test interference."""
        imgs, targets = batch
        outs = self.model(imgs)

        # If move to CPU is not done, error is `torch.sort does not support boolean tensor`.
        for target in targets:
            target["boxes"] = target["boxes"].cpu()
            target["labels"] = target["labels"].cpu()
        for target in outs:
            target["boxes"] = target["boxes"].cpu()
            target["labels"] = target["labels"].cpu()
            target["scores"] = target["scores"].cpu()
        # Update MAP metric created in `__init__`
        self._voc_map.update(
            outs,
            targets
        )
        # No reason to return anything as the values have been collected to `MAP`
        return None

    def test_epoch_end(self, output_results):
        """Test function ran after all test steps. Aggregates all test results to mAP."""        
        voc = self._voc_map.compute() # Fails with `RuntimeError: Tensors must be CUDA and dense`
        self.log("mAP", voc)

Expected behavior

The MAP.update updates values correctly whether or not they had been transferred to GPU device and compute does not fail with RuntimeError: Tensors must be CUDA and dense during call to torch.distributed.gather_all.

Environment

  • PyTorch Version (e.g., 1.0): 1.9.1
  • PL Bolts Version: 0.4.0
  • TorchMetrics: 0.6.1
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): poetry installation
  • Python version: 3.9.6
  • CUDA: 11.3:
  • NVIDIA GeForce 2080 Ti x 2

Additional context

distributed_backend=nccl

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Is this issue solved? I got the same error using version 0.7.0. Should I open a new issue, or is this work in progress.

@timurlenk07 mind providing your sample code? otherwise I think it the same as: https://github.com/PyTorchLightning/metrics/issues/608#issuecomment-963153184