espnet: got a gradient error in first epoch
Have you met this error before? @sw005320
./run.sh --stage 4 --queue g.q --ngpu 4 --etype vggblstm --elayers 3 --eunits 1024 --eprojs 1024 --batchsize 16 --train_set train_nodev_perturb --maxlen_in 2200
0 19700 14.31 15.7234 12.8966 0.871656 70311 1e-08
Exception in main training loop: invalid gradient at index 0 - expected shape [2] but got [4]
Traceback (most recent call last):............................] 18.09%
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()ters/sec. Estimated time to finish: 102 days, 21:11:27.119712.
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
main()
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
train(args)
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 365, in train
trainer.run()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [2] but got [4]
# Accounting: time=71697 threads=1
# Finished at Thu Nov 8 13:12:38 CST 2018 with status 1
Exception in main training loop: invalid gradient at index 0 - expected shape [3] but got [4]
Traceback (most recent call last):............................] 37.67%
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()ters/sec. Estimated time to finish: 8 days, 21:37:22.134266.
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
main()
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
train(args)
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 362, in train
trainer.run()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [3] but got [4]
# Accounting: time=15618 threads=1
# Finished at Sun Oct 28 02:59:52 CST 2018 with status 1
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (15 by maintainers)
#gpus x 2