speechbrain: I tried to run a librispeech transformer recipe with 8 GPU but a word error rate remains very large.

I tried to run a librispeech transformer recipe with 8 GPU by using DDP (https://github.com/speechbrain/speechbrain/blob/develop/recipes/LibriSpeech/ASR/transformer/train.py) but a word error rate remains very large(around 100%) in spite of 10 epochs.

epoch: 1, lr: 1.00e+00, steps: 4056, optimizer: Adam - train loss: 2.50e+02 - valid loss: 1.32e+02, valid ACC: 1.97e-01
epoch: 2, lr: 1.17e-04, steps: 12845, optimizer: Adam - train loss: 2.11e+02 - valid loss: 1.27e+02, valid ACC: 2.24e-01
epoch: 3, lr: 1.97e-04, steps: 21634, optimizer: Adam - train loss: 2.04e+02 - valid loss: 1.25e+02, valid ACC: 2.43e-01
epoch: 4, lr: 2.07e-04, steps: 30423, optimizer: Adam - train loss: 1.98e+02 - valid loss: 1.23e+02, valid ACC: 2.58e-01
epoch: 5, lr: 1.82e-04, steps: 39212, optimizer: Adam - train loss: 1.93e+02 - valid loss: 1.21e+02, valid ACC: 2.67e-01
epoch: 6, lr: 1.65e-04, steps: 48001, optimizer: Adam - train loss: 1.89e+02 - valid loss: 1.21e+02, valid ACC: 2.71e-01
epoch: 7, lr: 1.51e-04, steps: 56790, optimizer: Adam - train loss: 1.85e+02 - valid loss: 1.21e+02, valid ACC: 2.70e-01
epoch: 8, lr: 1.41e-04, steps: 65579, optimizer: Adam - train loss: 1.82e+02 - valid loss: 1.22e+02, valid ACC: 2.67e-01
epoch: 9, lr: 1.32e-04, steps: 74368, optimizer: Adam - train loss: 1.79e+02 - valid loss: 1.23e+02, valid ACC: 2.64e-01
epoch: 10, lr: 1.25e-04, steps: 83157, optimizer: Adam - train loss: 1.76e+02 - valid loss: 1.24e+02, valid ACC: 2.61e-01, valid WER: 96.31

I ran the following command.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train.py hparams/transformer.yaml --distributed_launch --distributed_backend='nccl'

I reduced batch_size from 16 to 4 in order to avoid Out Of Memory Error and changed gradient_accumulation from 4 to 1 according to https://github.com/speechbrain/speechbrain/issues/899. I also tried to train setting gradient_accumulation to 4 and 2, but the results were no different. My environment is as follows.

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-1063-aws-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 460.106.00

A commit hash of the speechbrain is d6bfe13. Could you give me any hint? Thanks for your help.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15

Most upvoted comments

Hey folks, we updated the whole librispeech recipe. Now the model should be 1. better; 2. much smaller and therefore easier to train with less GPUs 😃

TParcollet on Mar 24, 2022

Thank you for very kind supports. We could train models by using the new script. So, I will close this issue.

ken57 on May 17, 2022