NeMo: ASR finetuning/training 7x slower on v1.8.2

Describe the bug

I recently updated from NeMo 1.7.2 to 1.8.2 When running a training session, the speed is now 4.04s/it whereas it used to be 2.97it/s. Note it’s taking ~4 seconds per iteration whereas it used to be ~3 iterations per second!

Here’s my trainer config since it’s the only things that’s different from the NeMo examples config for CitriNet-1024 with SPE tokenizer:

trainer:
  devices: 8
  accelerator: gpu
  max_epochs: 100
  max_steps: -1
  num_nodes: 1
  strategy: ddp
  accumulate_grad_batches: 4
  enable_checkpointing: false  # Provided by exp_manager
  logger: false                # Provided by exp_manager
  log_every_n_steps: 100
  val_check_interval: 1.0
  precision: 16

Steps/Code to reproduce bug

Run any training example script.

Expected behavior

Speed should not be significantly affected.

Environment overview (please complete the following information)

Environment location: Docker (tried nvcr.io/nvidia/pytorch:22.03-py3 and nvcr.io/nvidia/pytorch:22.02-py3)
Method of NeMo install: pip install nemo_toolkit[all]==1.8.2
If method of install is [Docker], provide docker pull & docker run commands used

docker run --rm -it --gpus all --ipc=host --env-file .env train

Additional context

Running this on an AWS p3dn.24xlarge instance (8xV100)

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 37 (18 by maintainers)

Most upvoted comments

Validation and test datasets are never tarred. We actually explicitly do not support it because tarring drops a few samples, and therefore fair academic comparisons would be impossible. Usually validation and test datasets are run infrequently - once every epoch, and are orders of magnitude smaller than the train set so it makes sense to leave them as loose files.

Re incomplete epoch, No this is not expected. We see full epoch completion with tarred datasets same as with normal datasets, even step per epoch should be roughly same (± a little due to file dropping). Do you have val check interval < 1.0?

Agreed with your point regarding documentation, will update next week.

titu1994 on May 21, 2022

Reading 100k files vs 1024 tar files is very different behaviour for networked IO - all cloud providers don’t physically attach storage to your vms, they use mounted network storage. This is the cost - since to manage and manipulate 100k files is much harder for network file systems than just managing 1024 tarfiles.

titu1994 on May 19, 2022

Wow… what a PITA this whole investigation was only to realize it was because of single flag and an undocumented change (nothing in the release notes and no logging warnings…)

I just ran this with benchmark: false and I got the performance back to what it was expected… (slightly faster too)

and yes, if you could update the default configs in the repo that would be appreciated.

piraka9011 on May 25, 2022

Ok, it looks like tarred datasets definitely result in a speed up to training as suggested. The first epoch is still slower (for me it went from ~40 min/epoch to 6 min/epoch) but the following epoch is significantly faster (from 20-25 min/epoch to 2 min/epoch).

Thanks again for helping me debug this.

Two things before closing the issue though:

Should validation manifests also be tarred? or that doesn’t matter?
The epoch now only completes partially before starting validation Here’s an example of what this looks like. Is this expected?

Epoch 3:  23%|█████████████████▌                                                            | 221/981 [01:25<04:53,  2.59it/s, loss=614, v_num=0-55]
Epoch 3, global step 112: 'val_wer' reached 1.03452 (best 1.03452), saving model to '/home/ubuntu/tdreeb/experiments/CnLgGm025_SpeUni1024_DI_EATL/2022-05-21_21-20-55/checkpoints/CnLgGm025_SpeUni1024_DI_EATL--val_wer=1.0345-epoch=3.ckpt' as top 2

piraka9011 on May 21, 2022

all cloud providers don’t physically attach storage to your vms, they use mounted network storage.

I see… I did not know that! Thank you for clarifying this. I will try another training run but with a tarred dataset and see if that improves performance. We do have ~500k individual files, so what you’re suggesting makes sense.

If this is indeed the case, would be something good to point out in the docs that JSON manifests may incur a slowdown after ~100k files, esp. if you are using a cloud provider/network storage? (Maybe here?)

piraka9011 on May 20, 2022