espnet: ASR training hangs in epoch 0, after few iterations
Hi,
using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1
Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.
I am using pytorch backend with ngpus=4. There is no error in the log.
tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.
Output of nvidia-smi. GPU utilization remains at zero after few iterations
using cuda-8.0.61 and cudnn-6
Any comments on this?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- use spawn in multiprocessing to fix #404 default multiprocessing has a problem with not fork-safe libraries (e.g. MKL) and as a result the worker process is dead locked without any error massages. Us... — committed to kan-bayashi/espnet by kan-bayashi 5 years ago
- Merge pull request #1251 from kan-bayashi/fix/multiprocessing Use spawn in multiprocessing to fix #404 — committed to espnet/espnet by ShigekiKarita 5 years ago
same problem still. Considering re-write IO part with pytorch dataloader.
https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html