NeMo: Unable to train fastpitch_finetune with nemo:1.8.0
When I try to run experiments/tts/fastpitch_finetune.py
with fastpitch_align_v1.05.yml
with ddp
on multiple GPUs the training immediately crashes with the following trace
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 231, in advance
self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1628, in _call_callback_hooks
self._on_train_batch_end(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1660, in _on_train_batch_end
callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx)
File "/workspace/nemo/nemo/utils/exp_manager.py", line 144, in on_train_batch_end
self._on_batch_end("train_step_timing", pl_module)
File "/workspace/nemo/nemo/utils/exp_manager.py", line 138, in _on_batch_end
pl_module.log(name, self.timer[name], on_step=True, on_epoch=False)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 381, in log
value = apply_to_collection(value, numbers.Number, self.__to_tensor)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 515, in __to_tensor
return torch.tensor(value, device=self.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "examples/tts/fastpitch_finetune.py", line 41, in <module>
main() # noqa pylint: disable=no-value-for-parameter
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/tts/fastpitch_finetune.py", line 37, in main
trainer.fit(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
self._teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
self.strategy.teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 447, in teardown
super().teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/parallel.py", line 134, in teardown
super().teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 444, in teardown
optimizers_to_device(self.optimizers, torch.device("cpu"))
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 27, in optimizers_to_device
optimizer_to_device(opt, device)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 33, in optimizer_to_device
optimizer.state[p] = apply_to_collection(v, torch.Tensor, move_data_to_device, device)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 107, in apply_to_collection
v = apply_to_collection(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
return apply_to_collection(batch, dtype=dtype, function=batch_to)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
This does not happen if I run on a single GPU.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 25 (14 by maintainers)
After some discussion, we’ve concluded that adam makes more sense as a default optimizer for FastPitch anyway, so I’d suggest sticking with that for now. I’ll push a change to switch it from lamb in the config soon.
We’ll still try to resolve this bug since it shouldn’t be crashing with lamb anyway. I’ve been able to reproduce the error locally.