espnet: ASR training hangs in epoch 0, after few iterations

Hi,

using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1

Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.

I am using pytorch backend with ngpus=4. There is no error in the log.

tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.

Output of nvidia-smi. GPU utilization remains at zero after few iterations

screen shot 2018-09-01 at 3 48 54 pm

using cuda-8.0.61 and cudnn-6

Any comments on this?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

same problem still. Considering re-write IO part with pytorch dataloader.

chainer/iterators/multiprocess_iterator.py:28: TimeoutWarning: Stalled dataset is detected. 
See the documentation of MultiprocessIterator for common causes and workarounds:

https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html