bevfusion: RuntimeError: sigmoid_focal_loss_forward_impl: implementation for device cuda:0 not found.

Hi, I tried to train the model for LiDAR-only detector using this command:

torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/lidar/voxelnet_0p075.yaml

but got the following error. All the other training commands are working fine exept this one. Do I need to build any additional library? Any suggestion? Thanks.

Traceback (most recent call last):
  File "tools/train.py", line 87, in <module>
    main()
  File "tools/train.py", line 76, in main
    train_model(
  File "/home/trainer/bevnet/mmdet3d/apis/train.py", line 126, in train_model
    runner.run(data_loaders, [("train", 1)])
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/trainer/bevnet/mmdet3d/runner/epoch_based_runner.py", line 14, in train
    super().train(data_loader, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/usr/local/lib/python3.8/dist-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/trainer/bevnet/mmdet3d/models/fusion_models/base.py", line 78, in train_step
    losses = self(**data)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/trainer/bevnet/mmdet3d/models/fusion_models/bevfusion.py", line 187, in forward
    outputs = self.forward_single(
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/trainer/bevnet/mmdet3d/models/fusion_models/bevfusion.py", line 269, in forward_single
    losses = head.loss(gt_bboxes_3d, gt_labels_3d, pred_dict)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/trainer/bevnet/mmdet3d/models/heads/bbox/transfusion.py", line 645, in loss
    layer_loss_cls = self.loss_cls(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/focal_loss.py", line 233, in forward
    loss_cls = self.loss_weight * calculate_loss_func(
  File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/focal_loss.py", line 139, in sigmoid_focal_loss
    loss = _sigmoid_focal_loss(pred.contiguous(), target.contiguous(), gamma,
  File "/usr/local/lib/python3.8/dist-packages/mmcv/ops/focal_loss.py", line 55, in forward
    ext_module.sigmoid_focal_loss_forward(
RuntimeError: sigmoid_focal_loss_forward_impl: implementation for device cuda:0 not found.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (6 by maintainers)

Most upvoted comments

This is my base image for Dockerfile nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 and I am using Driver Version: 520.61.05 CUDA Version: 11.8 on the host.

could you share your docker image, I upgrade v100 host pc cuda version as yours, but it still arise this error.

I solve this by reinstall mmcv and mmcv-full with MMCV_WITH_OPS=1 FORCE_CUDA=1. The specific script for the docker environment is: MMCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html MMCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html I’m using A100 with host cuda version 11.4. After reinstall, you can use mmcv.utils.collect_env() to check. The mmcv cuda compiler should be 11.3.

This is my base image for Dockerfile nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 and I am using Driver Version: 520.61.05 CUDA Version: 11.8 on the host.

I will try that.