STT: Bug: Batches are not evaluated in parallel
Describe the bug I tried running STT on a system with 8 NVIDA A100 GPUs. I experienced that running on up to 4 devices does not scale well. Each batch seems to be serialized across the GPUs. In the first steps everything looks ok but then parallel execution gets worse over the time. I attaches traces generated by NVIDIA NSight Systems
I checked by environment using a simple mnist example where this problem does not occur.
Environment (please complete the following information):
I used NVIDIAS Docker container nvcr.io/nvidia/tensorflow:22.02-tf1-py3
as it is used in Dockerfile.train
python -m coqui_stt_training.train \
--train_files $CLIPSDIR/train.csv \
--train_batch_size 128 \
--n_hidden 2048 \
--learning_rate 0.0001 \
--dropout_rate 0.40 \
--epochs 2 \
--cache_for_epochs 10 \
--read_buffer 1G \
--shuffle_batches true \
--lm_alpha 0.931289039105002 \
--lm_beta 1.1834137581510284 \
--log_level 1 \
--checkpoint_dir ${OUTDIR}/checkpoints \
--summary_dir ${OUTDIR}/tensorboard \
--train_cudnn true \
--alphabet_config_path=$CLIPSDIR/alphabet.txt \
--skip_batch_test true \
--automatic_mixed_precision true \
--show_progressbar false
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 24 (4 by maintainers)
Hi,
It was an issue of transfer learning which I forgot to add alphabet.txt correctly. Now running on coqui docker image 1.0.0 and using batch_size of 16 and using all 8GPU.
On newer coqui docker images, it is unable to do so.
No offense, but not everybody has 3+ GPU systems, let alone 8xA100’s 😃 I know I don’t have them 😄
I think with 4gb gpu memory you should be able to use bigger batch sizes than 1. Using bigger batch sizes will result in less training time.
Nevertheless, from me opinion this is not related to this issue. Maybe open an own issue.
Can you give more details which information you need @SuperKogito ? I tried running on a system using 8 x NVIDIA A100-SXM4 (40 GB RAM) and 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz. As I already said, did not scale with more than 4 gpus with a speedup close to 2 if I mentioned correctly. Kernels are simply not evaluated in parallel.
One of the problems in my setup was caused by pythons multiprocessing. I could not figured out the specific problem but using MPI instead of multiprocessing pool helped. Moreover, I that running without
--normalize_sample_rate false --audio_sample_rate 16000
so doing resampling on CPU while running give problems satisfying 8 GPUs with data.Moreover I switched to Horovod for multi gpu usage but it is not supported in official STT
To sum it up, from my perspective you will not benefit from running on more than 2 A100 devices, since this seems to be in no interested of the devs at the moment.