bevfusion: RuntimeError: sigmoid_focal_loss_forward_impl: implementation for device cuda:0 not found.
Hi, I tried to train the model for LiDAR-only detector using this command:
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/lidar/voxelnet_0p075.yaml
but got the following error. All the other training commands are working fine exept this one. Do I need to build any additional library? Any suggestion? Thanks.
Traceback (most recent call last):
File "tools/train.py", line 87, in <module>
main()
File "tools/train.py", line 76, in main
train_model(
File "/home/trainer/bevnet/mmdet3d/apis/train.py", line 126, in train_model
runner.run(data_loaders, [("train", 1)])
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/trainer/bevnet/mmdet3d/runner/epoch_based_runner.py", line 14, in train
super().train(data_loader, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/usr/local/lib/python3.8/dist-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/trainer/bevnet/mmdet3d/models/fusion_models/base.py", line 78, in train_step
losses = self(**data)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/trainer/bevnet/mmdet3d/models/fusion_models/bevfusion.py", line 187, in forward
outputs = self.forward_single(
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/trainer/bevnet/mmdet3d/models/fusion_models/bevfusion.py", line 269, in forward_single
losses = head.loss(gt_bboxes_3d, gt_labels_3d, pred_dict)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/trainer/bevnet/mmdet3d/models/heads/bbox/transfusion.py", line 645, in loss
layer_loss_cls = self.loss_cls(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/focal_loss.py", line 233, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/focal_loss.py", line 139, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target.contiguous(), gamma,
File "/usr/local/lib/python3.8/dist-packages/mmcv/ops/focal_loss.py", line 55, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: sigmoid_focal_loss_forward_impl: implementation for device cuda:0 not found.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (6 by maintainers)
I solve this by reinstall mmcv and mmcv-full with
MMCV_WITH_OPS=1 FORCE_CUDA=1. The specific script for the docker environment is:MMCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.htmlMMCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.htmlI’m using A100 with host cuda version 11.4. After reinstall, you can usemmcv.utils.collect_env()to check. Themmcv cuda compilershould be 11.3.This is my base image for Dockerfile
nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04and I am using DriverVersion: 520.61.05 CUDA Version: 11.8on the host.I will try that.