espnet: got a gradient error in first epoch

Have you met this error before? @sw005320

./run.sh --stage 4 --queue g.q --ngpu 4 --etype vggblstm --elayers 3 --eunits 1024 --eprojs 1024 --batchsize 16 --train_set train_nodev_perturb --maxlen_in 2200
0           19700       14.31       15.7234        12.8966                                                                                  0.871656                         70311         1e-08
Exception in main training loop: invalid gradient at index 0 - expected shape [2] but got [4]
Traceback (most recent call last):............................] 18.09%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()ters/sec. Estimated time to finish: 102 days, 21:11:27.119712.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>                                                                                   
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main                                                                                       
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 365, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run                                                               
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [2] but got [4]
# Accounting: time=71697 threads=1
# Finished at Thu Nov 8 13:12:38 CST 2018 with status 1
Exception in main training loop: invalid gradient at index 0 - expected shape [3] but got [4]
Traceback (most recent call last):............................] 37.67%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()ters/sec. Estimated time to finish: 8 days, 21:37:22.134266.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 362, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [3] but got [4]
# Accounting: time=15618 threads=1
# Finished at Sun Oct 28 02:59:52 CST 2018 with status 1

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

#gpus x 2