vedastr: RuntimeError: CUDA error: device-side assert triggered
When i was training my own dataset, I modified following in the config flie:
character = ‘aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊj JkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮ ứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789’ batch_max_length = 25 num_class = len(character) + 1 # num_class = 197 gpu_id=‘5,7’
and I ran the command: bash tools/dist_train.sh configs/stn_cstr.py 2
then I got the error:
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes
failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes
failed.
Traceback (most recent call last):
File “/home/recognition/vedastr-cstr/tools/train.py”, line 49, in <module>
main()
File “/home/recognition/vedastr-cstr/tools/train.py”, line 45, in main
runner()
File “/home/recognition/vedastr-cstr/tools/…/vedastr/runners/train_runner.py”, line 165, in call
self._train_batch(img, label)
File “/home/recognition/vedastr-cstr/tools/…/vedastr/runners/train_runner.py”, line 118, in _train_batch
loss.backward()
File “/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/tensor.py”, line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/autograd/init.py”, line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered
What might cause it? How to fix it? Thank in advance.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 15 (7 by maintainers)
@bharatsubedi @PhamLeQuangNhat Hi, Sorry to have these problems. I will fix some bugs and make the config file more clearly today. After i test the code successfully, i will update the cstr branch.
@PhamLeQuangNhat if you add
find_unused_parameters=True
in file inference_runner.py inside DistributedDataParallel(find_unused_parameters=True) this error will not happen, but you will receive a problem during validation as you mention. I don’t know how to solve that error; we have to figure out and share everyone.