vedastr: RuntimeError: CUDA error: device-side assert triggered

When i was training my own dataset, I modified following in the config flie:

character = ‘aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊj JkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮ ứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789’ batch_max_length = 25 num_class = len(character) + 1 # num_class = 197 gpu_id=‘5,7’

and I ran the command: bash tools/dist_train.sh configs/stn_cstr.py 2

then I got the error:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File “/home/recognition/vedastr-cstr/tools/train.py”, line 49, in <module> main() File “/home/recognition/vedastr-cstr/tools/train.py”, line 45, in main runner() File “/home/recognition/vedastr-cstr/tools/…/vedastr/runners/train_runner.py”, line 165, in call self._train_batch(img, label) File “/home/recognition/vedastr-cstr/tools/…/vedastr/runners/train_runner.py”, line 118, in _train_batch loss.backward() File “/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/tensor.py”, line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File “/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/autograd/init.py”, line 145, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: device-side assert triggered

What might cause it? How to fix it? Thank in advance.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

@bharatsubedi @PhamLeQuangNhat Hi, Sorry to have these problems. I will fix some bugs and make the config file more clearly today. After i test the code successfully, i will update the cstr branch.

@PhamLeQuangNhat if you add find_unused_parameters=True in file inference_runner.py inside DistributedDataParallel(find_unused_parameters=True) this error will not happen, but you will receive a problem during validation as you mention. I don’t know how to solve that error; we have to figure out and share everyone.