NeMo: Can't train ASR conformer-transducer
Describe the bug
Hello, guys! I have a problem: I tried to train Conformer-Transducer model. But I stucked at the start. I think it’s because of trainer parameters or some cuda error but I’m note sure…
[NeMo W 2023-10-17 16:18:53 optimizers:54] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-10-17 16:18:58 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
device: cuda
[NeMo I 2023-10-17 16:19:03 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens
[NeMo W 2023-10-17 16:19:05 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: true
num_workers: 8
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 20.0
min_duration: 0.1
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
bucketing_strategy: synced_randomized
bucketing_batch_size: null
bucketing_weights: ''
[NeMo W 2023-10-17 16:19:05 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: false
[NeMo W 2023-10-17 16:19:05 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: false
[NeMo I 2023-10-17 16:19:05 features:287] PADDING: 0
[NeMo W 2023-10-17 16:19:07 nemo_logging:349] /root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/torch/nn/modules/rnn.py:67: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1
warnings.warn("dropout option adds dropout after all but last "
[NeMo I 2023-10-17 16:19:07 rnnt_models:206] Using RNNT Loss : warprnnt_numba
Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2023-10-17 16:19:16 save_restore_connector:249] Model EncDecRNNTBPEModel was successfully restored from /root/.cache/huggingface/hub/models--nvidia--stt_ru_conformer_transducer_large/snapshots/687d02db291e931455cf321abd625ef2b7f0b1a9/stt_ru_conformer_transducer_large.nemo.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2023-10-17 16:19:17 collections:193] Dataset loaded with 26757 files totalling 31.35 hours
[NeMo I 2023-10-17 16:19:17 collections:194] 0 files were filtered totalling 0.00 hours
[NeMo I 2023-10-17 16:19:19 collections:193] Dataset loaded with 7135 files totalling 8.24 hours
[NeMo I 2023-10-17 16:19:19 collections:194] 0 files were filtered totalling 0.00 hours
[NeMo W 2023-10-17 16:19:19 audio_to_text_dataset:675] Could not load dataset as `manifest_filepath` was None. Provided config : {'manifest_filepath': None, 'sample_rate': 16000, 'batch_size': 16, 'shuffle': False, 'num_workers': 8, 'pin_memory': True, 'use_start_end_token': False}
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo I 2023-10-17 16:19:20 modelPT:722] Optimizer config = Novograd (
Parameter Group 0
amsgrad: False
betas: [0.9, 0.98]
eps: 1e-08
grad_averaging: False
lr: 0.0001
weight_decay: 0.001
)
[NeMo I 2023-10-17 16:19:20 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fa5600805b0>"
will be used during training (effective maximum steps = 22400) -
Parameters :
(warmup_steps: 10000
warmup_ratio: null
min_lr: 1.0e-06
max_steps: 22400
)
| Name | Type | Params
------------------------------------------------------------------------
0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0
1 | encoder | ConformerEncoder | 115 M
2 | decoder | RNNTDecoder | 3.9 M
3 | joint | RNNTJoint | 1.4 M
4 | loss | RNNTLoss | 0
5 | spec_augmentation | SpectrogramAugmentation | 0
6 | wer | RNNTBPEWER | 0
------------------------------------------------------------------------
5.4 M Trainable params
115 M Non-trainable params
120 M Total params
481.780 Total estimated model params size (MB)
Epoch 0: 0%| | 0/1130 [00:00<?, ?it/s]Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/sdb/nemo/stt_conformer/src/models/train_model.py", line 156, in <module>
main()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/src/models/train_model.py", line 150, in main
trainer.fit(model)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 239, in _run_optimization
closure()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
self._result = self.closure(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
step_output = self._step_fn()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
return self.model.training_step(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/utils/model_utils.py", line 380, in wrap_training_step
output_dict = wrapped(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/models/rnnt_models.py", line 712, in training_step
loss_value, wer, _, _ = self.joint(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/core/classes/common.py", line 1087, in __call__
outputs = wrapped(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/modules/rnnt.py", line 1335, in forward
loss_batch = self.loss(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/core/classes/common.py", line 1087, in __call__
outputs = wrapped(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/losses/rnnt.py", line 361, in forward
loss = self._loss(acts=log_probs, labels=targets, act_lens=input_lengths, label_lens=target_lengths)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py", line 281, in forward
return self.loss(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/rnnt_pytorch.py", line 62, in forward
loss_func(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/rnnt.py", line 223, in rnnt_loss_gpu
status = wrapper.cost_and_grad(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/gpu_rnnt.py", line 249, in cost_and_grad
return self.compute_cost_and_score(acts, grads, costs, pad_labels, label_lengths, input_lengths)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/gpu_rnnt.py", line 158, in compute_cost_and_score
self.log_softmax(acts, denom)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/gpu_rnnt.py", line 104, in log_softmax
reduce.reduce_max(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/reduce.py", line 353, in reduce_max
return ReduceHelper(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/reduce.py", line 294, in ReduceHelper
_reduce_rows[grid_size, CTA_REDUCE_SIZE, stream, 0](I_opid, R_opid, acts, output, num_rows)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 542, in __call__
return self.dispatcher.call(args, self.griddim, self.blockdim,
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 676, in call
kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 684, in _compile_for_args
return self.compile(tuple(argtypes))
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 927, in compile
kernel = _Kernel(self.py_func, argtypes, **self.targetoptions)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/dispatcher.py", line 84, in __init__
cres = compile_cuda(self.py_func, types.void, self.argtypes,
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/cuda/compiler.py", line 230, in compile_cuda
cres = compiler.compile_extra(typingctx=typingctx,
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler.py", line 742, in compile_extra
return pipeline.compile_extra(func)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler.py", line 460, in compile_extra
return self._compile_bytecode()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler.py", line 528, in _compile_bytecode
return self._compile_core()
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler.py", line 507, in _compile_core
raise e
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler.py", line 494, in _compile_core
pm.run(self.state)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_machinery.py", line 368, in run
raise patched_exception
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_machinery.py", line 356, in run
self._runPass(idx, pass_inst, state)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/compiler_machinery.py", line 273, in check
mangled = func(compiler_state)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/typed_passes.py", line 110, in run_pass
typemap, return_type, calltypes, errs = type_inference_stage(
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/typed_passes.py", line 88, in type_inference_stage
errs = infer.propagate(raise_errors=raise_errors)
File "/root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/numba/core/typeinfer.py", line 1086, in propagate
raise errors[0]
numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Internal error at <numba.core.typeinfer.CallConstraint object at 0x7fa59609cbb0>.
libNVVM cannot be found. Do `conda install cudatoolkit`:
[Errno 2] No such file or directory: '/usr/local/cuda/nvvm/lib64'
During: resolving callee type: type(CUDADispatcher(<function exponential at 0x7fa56b25c430>))
During: typing of call at /root/sdb/nemo/stt_conformer/new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/reduce.py (158)
Enable logging at debug level for details.
File "new_venv/lib/python3.8/site-packages/nemo/collections/asr/parts/numba/rnnt_loss/utils/cuda_utils/reduce.py", line 158:
def _reduce_rows(I_opid: int, R_opid: int, acts, output, num_rows: int):
<source elided>
if I_opid == 0:
curr = rnnt_helper.exponential(curr)
^
Epoch 0: 0%| | 0/1130 [00:11<?, ?it/s]
I install environment using pip install -r requirements.txt
.
requirements
contains the following:
pandas==1.5.3
torch==1.13.0
pytorch_lightning==1.8.6
omegaconf==2.2.3
nemo_toolkit==1.18.1
optuna==3.1.0
pyctcdecode==0.5.0
swifter==1.3.4
openpyxl==3.1.2
torchvision==0.14.0
torchmetrics==0.11.4
torchaudio==0.13.0
nemo-text-processing==0.1.7rc0
jiwer==3.0.1
hydra-core==1.3.2
librosa==0.10.1
sentencepiece==0.1.99
youtokentome==1.0.6
braceexpand==0.1.7
webdataset==0.1.62
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
editdistance==0.6.2
Here is nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID V100DX-32Q On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 32768MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
OS: Ubuntu 18.04.6 LTS Python version: Python 3.8.16
What really surprised me that if I train Conformer model, it is ok.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 34 (2 by maintainers)
Quite odd, cudatoolkit is part and parcel of pytorch. Can you try doing conda install cudatoolkit=11.8 and see if it solves this problem ?
You can train it, but it’s harder. Youll need aggressive tuning grid search to see how to minmax results on small data. Or you can use adapters if you’re not changing the tokenizer of the model.
Rnnt loss and joint computation is super expensive on memory, so we disable eval loss calculation for rnnt, instead using val_wer as metric.
We still have the flag “compute_eval_loss” to compute eval loss if you really need it but it wastes a lot of memory
Oh that is perfectly fine, do not worry about the performance warning.
The cuda kernel for rnnt is designed in a way that its optimal at large batch sizes but rnnt at large batch sizes will exhaust memory. It doesn’t matter much, the cuda kernel is still 300x faster than a hand written pytorch loop with autograd and it computed the loss in 20-50 ms per step anyway.
It’s finally working !
Installation was fine. Now it’s NeMo time 👍
------------------------- Conda problems ------------------------- Numba installation you are here ------> NeMo installation ------------------------- Tests ------------------------- Successful ASR model training
Your numba, cuda and pytorch install seems botched somehow. Id start from a fresh conda conda environment, stick to installing everything using conda only and maybe that will work