mmaction2: Training error for 2s-agcn
I am training 2s-agcn.
Raw skeleton data are downloaded from here.
Converted to mmaction2 format using gen_ntu_rgbd_raw.py .
So have two foldersxsub and xviewafter conversion.
Then the follow command is used to train.
python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic
The whole errors are as follows. What could be wrong?
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "tools/train.py", line 205, in <module>
main()
File "tools/train.py", line 201, in main
meta=meta)
File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 154, in train_step
loss, log_vars = self._parse_losses(losses)
File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 97, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7c8eecd8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f7c8f11f982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7c8eeb8b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f7ccc207b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f7ccc207c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f7cf6df93d5 in /lib64/libc.so.6)
Aborted (core dumped)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (9 by maintainers)
Yes. Ntu60 is the data from the first zip while ntu120 from both.