espnet: Multi Gpus with pytorch backend problem.
I am using Pytorch 0.4 with 4 GTX 1080Ti. When I run using pytorch backend and multi gpus, it gives me this error.
`# asr_train.py --ngpu 4 --backend pytorch --outdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/tr_en_units.txt --debugdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150 --minibatches 0 --verbose 0 --resume --train-json dump/tr_en/deltafalse/data.json --valid-json dump/dt_en/deltafalse/data.json --etype vggblstmp --elayers 4 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 30 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 100
Started at Fri Jul 6 10:05:31 CST 2018
2018-07-06 10:05:31,582 (asr_train:146) WARNING: Skip DEBUG/INFO messages
2018-07-06 10:05:31,587 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set.
2018-07-06 10:05:35,803 (e2e_asr_attctc_th:198) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.
Exception in main training loop: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0 failed.
Traceback (most recent call last):
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run
update()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update
self.update_core()
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core
loss = 1. / self.num_gpu * self.model(x)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply
raise output
Will finalize trainer extensions and updater before reraising the exception.
[JTraceback (most recent call last):
File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 224, in <module>
main()
File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 218, in main
train(args)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 377, in train
trainer.run()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 320, in run
six.reraise(*sys.exc_info())
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run
update()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update
self.update_core()
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core
loss = 1. / self.num_gpu * self.model(x)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply
raise output
RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion `output_nr == 0` failed.`
I am using Pytorch 0.4. I googled it and found this link to be useful.
Thanks, George.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 21 (14 by maintainers)
DistributedDataParallel will be much better for RNNs. Please use that if possible, even on single node.
Have a look at our Launch Utility documentation that cleanly describes how to use DistributedDataParallel: https://pytorch.org/docs/stable/distributed.html#launch-utility
You can treat your training script has not getting a split input, which also simplifies your code a lot.
@bobchennan DistributedDataParallel works for single nodes and it has been proved to have much better performance than data parallel. Check this and this
Shouldn’t we be using DistributedDataParallel instead of DataParallel?