NeMo: ASR finetuning/training 7x slower on v1.8.2
Describe the bug
I recently updated from NeMo 1.7.2 to 1.8.2
When running a training session, the speed is now 4.04s/it
whereas it used to be 2.97it/s
.
Note it’s taking ~4 seconds per iteration whereas it used to be ~3 iterations per second!
Here’s my trainer config since it’s the only things that’s different from the NeMo examples config for CitriNet-1024 with SPE tokenizer:
trainer:
devices: 8
accelerator: gpu
max_epochs: 100
max_steps: -1
num_nodes: 1
strategy: ddp
accumulate_grad_batches: 4
enable_checkpointing: false # Provided by exp_manager
logger: false # Provided by exp_manager
log_every_n_steps: 100
val_check_interval: 1.0
precision: 16
Steps/Code to reproduce bug
Run any training example script.
Expected behavior
Speed should not be significantly affected.
Environment overview (please complete the following information)
- Environment location: Docker (tried
nvcr.io/nvidia/pytorch:22.03-py3
andnvcr.io/nvidia/pytorch:22.02-py3
) - Method of NeMo install:
pip install nemo_toolkit[all]==1.8.2
- If method of install is [Docker], provide
docker pull
&docker run
commands used
docker run --rm -it --gpus all --ipc=host --env-file .env train
Additional context
Running this on an AWS p3dn.24xlarge instance (8xV100)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 37 (18 by maintainers)
Validation and test datasets are never tarred. We actually explicitly do not support it because tarring drops a few samples, and therefore fair academic comparisons would be impossible. Usually validation and test datasets are run infrequently - once every epoch, and are orders of magnitude smaller than the train set so it makes sense to leave them as loose files.
Re incomplete epoch, No this is not expected. We see full epoch completion with tarred datasets same as with normal datasets, even step per epoch should be roughly same (± a little due to file dropping). Do you have val check interval < 1.0?
Agreed with your point regarding documentation, will update next week.
Reading 100k files vs 1024 tar files is very different behaviour for networked IO - all cloud providers don’t physically attach storage to your vms, they use mounted network storage. This is the cost - since to manage and manipulate 100k files is much harder for network file systems than just managing 1024 tarfiles.
Wow… what a PITA this whole investigation was only to realize it was because of single flag and an undocumented change (nothing in the release notes and no logging warnings…)
I just ran this with
benchmark: false
and I got the performance back to what it was expected… (slightly faster too)and yes, if you could update the default configs in the repo that would be appreciated.
Ok, it looks like tarred datasets definitely result in a speed up to training as suggested. The first epoch is still slower (for me it went from ~40 min/epoch to 6 min/epoch) but the following epoch is significantly faster (from 20-25 min/epoch to 2 min/epoch).
Thanks again for helping me debug this.
Two things before closing the issue though:
I see… I did not know that! Thank you for clarifying this. I will try another training run but with a tarred dataset and see if that improves performance. We do have ~500k individual files, so what you’re suggesting makes sense.
If this is indeed the case, would be something good to point out in the docs that JSON manifests may incur a slowdown after ~100k files, esp. if you are using a cloud provider/network storage? (Maybe here?)