imvoxelnet: Runtime Error
I used the command as follow: srun -p Ai4sci_3D --gres gpu:1 bash tools/dist_train.sh configs/imvoxelnet/imvoxelnet_kitti.py 1
But I meet some problem:
(imvoxel) [nanuey@SH-IDC1-10-140-24-32 imvoxelnet]$ srun -p Ai4sci_3D --gres gpu:1 bash tools/dist_train.sh configs/imvoxelnet/imvoxelnet_kitti.py 1 srun: job 2563067 queued and waiting for resources srun: job 2563067 has been allocated resources srun: Job 2563067 scheduled successfully! Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition. Current PHX_PRIORITY is normal
Traceback (most recent call last):
File “tools/train.py”, line 15, in <module>
from mmdet3d.datasets import build_dataset
File “/mnt/petrelfs/dengken_nerf_perception/imvoxelnet/mmdet3d/datasets/init.py”, line 1, in <module>
from mmdet.datasets.builder import build_dataloader
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/datasets/init.py”, line 2, in <module>
from .cityscapes import CityscapesDataset
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/datasets/cityscapes.py”, line 16, in <module>
from .coco import CocoDataset
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/datasets/coco.py”, line 14, in <module>
from mmdet.core import eval_recalls
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/core/init.py”, line 2, in <module>
from .bbox import * # noqa: F401, F403
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/core/bbox/init.py”, line 7, in <module>
from .samplers import (BaseSampler, CombinedSampler,
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/core/bbox/samplers/init.py”, line 9, in <module>
from .score_hlr_sampler import ScoreHLRSampler
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmdet/core/bbox/samplers/score_hlr_sampler.py”, line 2, in <module>
from mmcv.ops import nms_match
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmcv/ops/init.py”, line 1, in <module>
from .bbox import bbox_overlaps
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmcv/ops/bbox.py”, line 3, in <module>
ext_module = ext_loader.load_ext(‘_ext’, [‘bbox_overlaps’])
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmcv/utils/ext_loader.py”, line 11, in load_ext
ext = importlib.import_module(‘mmcv.’ + name)
File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: /mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv
/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning, ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8099) of binary: /mnt/est/nanuey/anaconda3/envs/imvoxel/bin/python3 Traceback (most recent call last): File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launch.py”, line 193, in <module> main() File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launch.py”, line 189, in main launch(args) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launch.py”, line 174, in launch run(args) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/run.py”, line 713, in run )(*cmd_args) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-11-06_19:17:25 host : SH-IDC1-10-140-24-88 rank : 0 (local_rank: 0) exitcode : 1 (pid: 8099) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I have installed as your instruction, only changed the version of pytorch, I use pytorch 1.10.0, cuda 11.3, mmdet3d 0.8.0, mmdet 2.10, mmcv-full 1.2.7.
"pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Install MMCV
pip install mmcv-full==1.2.7+torch1.6.0+cu101 -f https://openmmlab.oss-accelerate.aliyuncs.com/mmcv/dist/index.html pip install mmdet==2.10.0
Install MMDetection
git clone https://github.com/saic-vul/imvoxelnet.git cd imvoxelnet pip install -r requirements/build.txt pip install --no-cache-dir -e .
Uninstall pycocotools installed by nuscenes-devkit and reinstall mmpycocotools
pip uninstall pycocotools --no-cache-dir -y pip install mmpycocotools==12.0.3 --no-cache-dir --force --no-deps
Install differentiable IoU
cd … git clone https://github.com/lilanxiao/Rotated_IoU cp -r Rotated_IoU/cuda_op imvoxelnet/mmdet3d/ops/rotated_iou cd imvoxelnet/mmdet3d/ops/rotated_iou/cuda_op python setup.py install"
But I still meet the problem I show: ImportError: /mnt/est/nanuey/anaconda3/envs/imvoxel/lib/python3.7/site-packages/mmcv/_ext.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv.
Could you please help me
Best wishes~
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 19 (5 by maintainers)
I success by using torch 1.8.1, cuda 10.1, mmcv1.3 mmdet 2.11. Thanks a lot