algorithmic-efficiency: Pytorch Conformer OOMS some times
Pytorch conformer occasionally OOMS.
Description
Traceback:
Traceback (most recent call last):
File "submission_runner.py", line 624, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "submission_runner.py", line 595, in main
score = score_submission_on_workload(
File "submission_runner.py", line 520, in score_submission_on_workload
timing, metrics = train_once(workload, global_batch_size,
File "submission_runner.py", line 299, in train_once
optimizer_state, model_params, model_state = update_params(
File "/algorithmic-efficiency/baselines/adamw/pytorch/submission.py", line 99, in update_params
loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 491, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 5 has a total capacty of 15.78 GiB of which 765.44 MiB is free. Process 11976 has 15.03 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 6.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps to Reproduce
Pytorch version: torch.dev08202023
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=librispeech_conformer --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=tests/regression_tests/adamw --overwrite=True --save_checkpoints=False --max_global_steps=10 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab --torch_compile=true
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 18 (14 by maintainers)
Upgrading the GPU driver to 535.104.05 seems to resolve the CUDA OOM, so we will upgrade the drivers on the competition hardware and mark this as resolved.
We also confirmed per recommendation form @lessw2020 that on pytorch 2.1.0 setting the following option resolves the OOM:
We won’t have to use this flag after all with the driver update but just want to document this in case we run into issues in the future.
I ran the following script which trains the conformer model for 1000 steps and repeats it 10 times to see how often it errors out. All 10 times, the model trained successfully without any error. I ran this inside docker built using the dockerfile on main branch. @priyakasimbeg , could you please let me know if this following script errs out on the VM you got the above error?
this doesn’t look like it’s ooming in the optimizer but rather the backward, no?
the fact that it’s nondeterministic is definitely weird…meaning the extra memory can come from anywhere. otherwise i would guess that activations checkpointing would help here.