algorithmic-efficiency: Pytorch Conformer OOMS some times

Pytorch conformer occasionally OOMS.

Description

Traceback:

Traceback (most recent call last):
  File "submission_runner.py", line 624, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 595, in main
    score = score_submission_on_workload(
  File "submission_runner.py", line 520, in score_submission_on_workload
    timing, metrics = train_once(workload, global_batch_size,
  File "submission_runner.py", line 299, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/algorithmic-efficiency/baselines/adamw/pytorch/submission.py", line 99, in update_params
    loss.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 5 has a total capacty of 15.78 GiB of which 765.44 MiB is free. Process 11976 has 15.03 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 6.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps to Reproduce

Pytorch version: torch.dev08202023

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=librispeech_conformer --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=tests/regression_tests/adamw --overwrite=True --save_checkpoints=False --max_global_steps=10 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab --torch_compile=true

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 18 (14 by maintainers)

Most upvoted comments

Upgrading the GPU driver to 535.104.05 seems to resolve the CUDA OOM, so we will upgrade the drivers on the competition hardware and mark this as resolved.

We also confirmed per recommendation form @lessw2020 that on pytorch 2.1.0 setting the following option resolves the OOM:

if torch.cuda.is_available():
    torch.cuda.memory._set_allocator_settings('expandable_segments:True')

We won’t have to use this flag after all with the driver update but just want to document this in case we run into issues in the future.

I ran the following script which trains the conformer model for 1000 steps and repeats it 10 times to see how often it errors out. All 10 times, the model trained successfully without any error. I ran this inside docker built using the dockerfile on main branch. @priyakasimbeg , could you please let me know if this following script errs out on the VM you got the above error?

command = """torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=librispeech_conformer --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/work_dir/data/ --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=tests/regression_tests/adamw --overwrite=True --save_checkpoints=False --max_global_steps=1000 --librispeech_tokenizer_vocab_path=/data/spm_model.vocab --torch_compile=true"""

import os
for i in range(10):
    code = os.system(command)
    print(code)```

this doesn’t look like it’s ooming in the optimizer but rather the backward, no?

the fact that it’s nondeterministic is definitely weird…meaning the extra memory can come from anywhere. otherwise i would guess that activations checkpointing would help here.