spconv: ConvTunerSimple_tune_and_cache.cc -> can't find suitable algorithm for 0
Ubuntu 20.04 Python: 3.10 CUDA: 12.0 GPU: 4090 Torch: 1.13 + cuda 11.7 Nvidia-driver: 525.85.12 Using: fp16 mixed precision (fp32 is fine)
I have tried various methods:
- install spconv_cu117 and cumm_cu117
- install spconv_cu120 and cumm_cu120
- build spconv and cumm
- build with JIT: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
- build wheel: spconv_cu117 and cumm_cu117 / spconv_cu120 and cumm_cu120
but all end up this:
Traceback (most recent call last):
File "/home/derek/2DPASS/main.py", line 177, in <module>
trainer.fit(my_model, train_dataset_loader, val_dataset_loader)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
self._run_train()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
self._run_sanity_check()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
val_loop.run()
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/derek/2DPASS/network/base_model.py", line 183, in validation_step
data_dict = self.forward(data_dict)
File "/home/derek/2DPASS/network/arch_2dpass.py", line 176, in forward
data_dict = self.model_3d(data_dict)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/spvcnn.py", line 175, in forward
enc_feats.append(self.spv_enc[i](data_dict)) # found spv_env[4] produce nan
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/spvcnn.py", line 89, in forward
v_fea = self.v_enc(data_dict['sparse_tensor'])
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/derek/2DPASS/network/basic_block.py", line 35, in forward
output = self.layers(x)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 741, in forward
return self._conv_forward(self.training, input, self.weight, self.bias, add_input,
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/conv.py", line 477, in _conv_forward
out_features, _, _ = ops.implicit_gemm(
File "/home/stardust/miniconda3/lib/python3.10/site-packages/spconv/pytorch/ops.py", line 1513, in implicit_gemm
mask_width, tune_res_cpp = ConvGemmOps.implicit_gemm(
RuntimeError: /home/derek/tools/spconv/build/temp.linux-x86_64-cpython-310/spconv/build/core_cc/src/csrc/sparse/convops/convops/ConvTunerSimple/ConvTunerSimple_tune_and_cache.cc(103)
!all_profile_res.empty() assert faild. can't find suitable algorithm for 0
Few things to notice:
- I reinstalled Ubuntu. It was working in previous Ubuntu system (22.04), I have copied previous miniconda folder over the new system. Maybe some residual file or corrupted file caused this?
- The error message above is captured when using wheel built on my system, but the last line is still pointing to a local file, not to a system folder. Looks very suspicious.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 17
I have encountered the same issue. FP32 works fine during both training and inference. FP16 works fine during training, but will got this error when inference.
@FindDefinition. I have encountered the same issue. FP32 works fine during both training and inference. FP16 works fine during training, but will got this error when inference. When using
torch.amp.autocast(), SparseConvTensor’s feature is not automatically converted totorch.float16after sparse convolution during inference (It will be converted totorch.float16during training). This bug is reported when usingx=x.replace_feature(x.features.half())to force a conversion to the fp16 data type.I got the same problem. Float64 input data maybe lead to this issue, when I check data as float32, it work finely.
use model.eval() and dont use fp16 dont use model.eval() and use fp16
I checked the script I provided. When I used 16-mixed precision training, the ConvTunerSimple.get_all_available returned an empty
finally_algos. It’s becausestatic_key_t static_key = std::make_tuple(layout_i, layout_w, layout_o,interleave_i, interleave_w, interleave_o, inp.dtype(),weight.dtype(), out.dtype(), op_type)I got is(1,1,1,1,1,1,7,0,0,0)which is not instatic_key_to_desps_. The input is f16, but the weight and output are f32.Because
implicit_gemmonly used whenalgo= spconv.ConvAlgo.MaskImplicitGemm, to avoid it, I need manually set allalgo=spconv.ConvAlgo.Native.Hope this helps somebody meet the same error.
Same issue here. For me, it happens when part of the network is trained with fp32 while the others are trained with fp16. The first fp32 op runs into error.
I am getting the same issue with newer versions of the library. Downgraded to Version 2.2.3 and it is working.
Details: CUDA 11.7 toolkit. Driver CUDA 11.7 515.86.01 RTX 3060