spconv: ConvTunerSimple_tune_and_cache.cc -> can't find suitable algorithm for 0

Ubuntu 20.04 Python: 3.10 CUDA: 12.0 GPU: 4090 Torch: 1.13 + cuda 11.7 Nvidia-driver: 525.85.12 Using: fp16 mixed precision (fp32 is fine)

I have tried various methods:

  • install spconv_cu117 and cumm_cu117
  • install spconv_cu120 and cumm_cu120
  • build spconv and cumm
  • build with JIT: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
  • build wheel: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120

but all end up this:

Traceback (most recent call last):
  File "/home/derek/2DPASS/main.py", line 177, in <module>
    trainer.fit(my_model, train_dataset_loader, val_dataset_loader)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
    results = self._run_stage()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
    self._run_train()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
    self._run_sanity_check()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
    val_loop.run()
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/derek/2DPASS/network/base_model.py", line 183, in validation_step
    data_dict = self.forward(data_dict)
  File "/home/derek/2DPASS/network/arch_2dpass.py", line 176, in forward
    data_dict = self.model_3d(data_dict)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/spvcnn.py", line 175, in forward
    enc_feats.append(self.spv_enc[i](data_dict)) # found spv_env[4] produce nan
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/spvcnn.py", line 89, in forward
    v_fea = self.v_enc(data_dict['sparse_tensor'])
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
    input = module(input)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/derek/2DPASS/network/basic_block.py", line 35, in forward
    output = self.layers(x)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
    input = module(input)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 741, in forward
    return self._conv_forward(self.training, input, self.weight, self.bias, add_input,
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 477, in _conv_forward
    out_features, _, _ = ops.implicit_gemm(
  File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/ops.py", line 1513, in implicit_gemm
    mask_width, tune_res_cpp = ConvGemmOps.implicit_gemm(
RuntimeError: /home/derek/tools/spconv/build/temp.linux-x86_64-cpython-310/spconv/build/core_cc/src/csrc/sparse/convops/convops/ConvTunerSimple/ConvTunerSimple_tune_and_cache.cc(103)
!all_profile_res.empty() assert faild. can't find suitable algorithm for 0

Few things to notice:

  1. I reinstalled Ubuntu. It was working in previous Ubuntu system (22.04), I have copied previous miniconda folder over the new system. Maybe some residual file or corrupted file caused this?
  2. The error message above is captured when using wheel built on my system, but the last line is still pointing to a local file, not to a system folder. Looks very suspicious.

About this issue

Most upvoted comments

I have encountered the same issue. FP32 works fine during both training and inference. FP16 works fine during training, but will got this error when inference.

@FindDefinition. I have encountered the same issue. FP32 works fine during both training and inference. FP16 works fine during training, but will got this error when inference. When using torch.amp.autocast(), SparseConvTensor’s feature is not automatically converted to torch.float16 after sparse convolution during inference (It will be converted to torch.float16 during training). This bug is reported when using x=x.replace_feature(x.features.half()) to force a conversion to the fp16 data type.

I got the same problem. Float64 input data maybe lead to this issue, when I check data as float32, it work finely.

我遇到了同样的问题。FP32 在训练和推理期间都能正常工作。FP16 在训练期间工作正常,但在推理时会出现此错误。

use model.eval() and dont use fp16 dont use model.eval() and use fp16

I checked the script I provided. When I used 16-mixed precision training, the ConvTunerSimple.get_all_available returned an empty finally_algos. It’s because static_key_t static_key = std::make_tuple(layout_i, layout_w, layout_o,interleave_i, interleave_w, interleave_o, inp.dtype(),weight.dtype(), out.dtype(), op_type) I got is (1,1,1,1,1,1,7,0,0,0) which is not in static_key_to_desps_. The input is f16, but the weight and output are f32.

Because implicit_gemm only used when algo= spconv.ConvAlgo.MaskImplicitGemm, to avoid it, I need manually set all algo=spconv.ConvAlgo.Native.

Hope this helps somebody meet the same error.

Same issue here. For me, it happens when part of the network is trained with fp32 while the others are trained with fp16. The first fp32 op runs into error.

I am getting the same issue with newer versions of the library. Downgraded to Version 2.2.3 and it is working.

Details: CUDA 11.7 toolkit. Driver CUDA 11.7 515.86.01 RTX 3060