mmaction2: Training error for 2s-agcn

I am training 2s-agcn. Raw skeleton data are downloaded from here. Converted to mmaction2 format using gen_ntu_rgbd_raw.py . So have two foldersxsub and xviewafter conversion.

Then the follow command is used to train. python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic

The whole errors are as follows. What could be wrong?

/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 154, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 97, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7c8eecd8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f7c8f11f982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7c8eeb8b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f7ccc207b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f7ccc207c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f7cf6df93d5 in /lib64/libc.so.6)

Aborted (core dumped)

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (9 by maintainers)

Most upvoted comments

I remember sth. I used all files inside
nturgbd_skeletons_s001_to_s017.zip  
nturgbd_skeletons_s018_to_s032.zip
ntu60 is for the first one nturgbd_skeletons_s001_to_s017.zip ? Let me do again.

Yes. Ntu60 is the data from the first zip while ntu120 from both.

gengenkai on Feb 19, 2022