espnet: ASR training hangs in epoch 0, after few iterations

Hi,

using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1

Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.

I am using pytorch backend with ngpus=4. There is no error in the log.

tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.

Output of nvidia-smi. GPU utilization remains at zero after few iterations

using cuda-8.0.61 and cudnn-6

Any comments on this?

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (11 by maintainers)

Commits related to this issue

use spawn in multiprocessing to fix #404 default multiprocessing has a problem with not fork-safe libraries (e.g. MKL) and as a result the worker process is dead locked without any error massages. Us... — committed to kan-bayashi/espnet by kan-bayashi 5 years ago
Merge pull request #1251 from kan-bayashi/fix/multiprocessing Use spawn in multiprocessing to fix #404 — committed to espnet/espnet by ShigekiKarita 5 years ago

Most upvoted comments

same problem still. Considering re-write IO part with pytorch dataloader.

chainer/iterators/multiprocess_iterator.py:28: TimeoutWarning: Stalled dataset is detected. 
See the documentation of MultiprocessIterator for common causes and workarounds:

https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html

bobchennan on Aug 22, 2019