algorithmic-efficiency: Pytorch Criteo CUDA error

Branch: dev Test link: https://github.com/mlcommons/algorithmic-efficiency/actions/runs/5416731116/jobs/9846848568 For details expand criteo_pytorch and then expand Run Containerized Workload.

Description

Criteo Pytorch OOMs.

Traceback:

ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 3; 15.78 GiB total capacity; 12.14 GiB already allocated; 307.44 MiB free; 14.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps to Reproduce

On kasimbeg-3

docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_dev 
  docker run  -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_dev -d criteo1tb -f pytorch -s baselines/adamw/pytorch/submission.py -w criteo1tb -t baselines/adamw/tuning_search_space.json -e test_today/adamw -m 10 -c False -o True -r false

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (19 by maintainers)

Most upvoted comments

I tracked down the issue and will send a fix soon (with a detailed explanation of why we had this OOM issue).

I think we should do some simple ablations first, since it did work recently without issues. @pomonam Have you tried running the code from the dev branch with PyTorch 1.13.1 instead of 2.0.1? To check if the issue is actually just due to the PyTorch update.

Hi @janeyx99, thank you for the comment! We did try setting fused=False, and this solved the OOM issue for some other workloads but not for this one. In my previous comment, I thought I solved the issue, but when creating another docker and rerunning the code, it seems like the issue persists. I am still debugging this, but I thought it would be useful to share what I have found so far and get feedbacks (in case you have any).

The code runs fine for three update steps. But, we are facing OOM at the fourth update step:

Traceback (most recent call last):
  File "submission_runner.py", line 644, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 615, in main
    score = score_submission_on_workload(
  File "submission_runner.py", line 538, in score_submission_on_workload
    timing, metrics = train_once(workload, global_batch_size,
  File "submission_runner.py", line 300, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/algorithmic-efficiency/baselines/adamw/pytorch/submission.py", line 115, in update_params
    optimizer_state['optimizer'].step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 440, in _single_tensor_adamw
    denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.78 GiB total capacity; 12.10 GiB already allocated; 285.44 MiB free; 14.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  1. Adding torch.cuda.empty_cache() after each gradient step solves the OOM issue. However, I am guessing this is not the ideal solution, as it can significantly slow down the training.
  2. Here is the output of torch.cuda.memory_summary() just before the third optimizer_state['optimizer'].step() (which does not raise OOM):
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  10344 MiB |  14441 MiB |  45674 MiB |  35330 MiB |
|       from large pool |  10338 MiB |  14435 MiB |  45632 MiB |  35294 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  10344 MiB |  14441 MiB |  45674 MiB |  35330 MiB |
|       from large pool |  10338 MiB |  14435 MiB |  45632 MiB |  35294 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  10342 MiB |  14438 MiB |  45661 MiB |  35319 MiB |
|       from large pool |  10336 MiB |  14432 MiB |  45619 MiB |  35283 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  14526 MiB |  14526 MiB |  18218 MiB |   3692 MiB |
|       from large pool |  14518 MiB |  14518 MiB |  18210 MiB |   3692 MiB |
|       from small pool |      8 MiB |      8 MiB |      8 MiB |      0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  85786 KiB |   3622 MiB |  23010 MiB |  22926 MiB |
|       from large pool |  85699 KiB |   3620 MiB |  22965 MiB |  22881 MiB |
|       from small pool |     86 KiB |      3 MiB |     44 MiB |     44 MiB |
|---------------------------------------------------------------------------|
| Allocations           |      80    |      98    |     606    |     526    |
|       from large pool |      22    |      41    |     263    |     241    |
|       from small pool |      58    |      76    |     343    |     285    |
|---------------------------------------------------------------------------|
| Active allocs         |      80    |      98    |     606    |     526    |
|       from large pool |      22    |      41    |     263    |     241    |
|       from small pool |      58    |      76    |     343    |     285    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      15    |      28    |      31    |      16    |
|       from large pool |      11    |      25    |      27    |      16    |
|       from small pool |       4    |       4    |       4    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       9    |      19    |     204    |     195    |
|       from large pool |       6    |      12    |      96    |      90    |
|       from small pool |       3    |       8    |     108    |     105    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

And this is the output just before the fourth optimizer_state['optimizer'].step() (which we get the error message above):

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  10343 MiB |  14441 MiB |  58834 MiB |  48490 MiB |
|       from large pool |  10337 MiB |  14435 MiB |  58778 MiB |  48440 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     50 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  10343 MiB |  14441 MiB |  58834 MiB |  48490 MiB |
|       from large pool |  10337 MiB |  14435 MiB |  58778 MiB |  48440 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     50 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  10342 MiB |  14438 MiB |  58816 MiB |  48474 MiB |
|       from large pool |  10336 MiB |  14432 MiB |  58761 MiB |  48424 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     49 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  14526 MiB |  14526 MiB |  18218 MiB |   3692 MiB |
|       from large pool |  14518 MiB |  14518 MiB |  18210 MiB |   3692 MiB |
|       from small pool |      8 MiB |      8 MiB |      8 MiB |      0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   2134 MiB |   3623 MiB |  33192 MiB |  31057 MiB |
|       from large pool |   2132 MiB |   3620 MiB |  33129 MiB |  30996 MiB |
|       from small pool |      2 MiB |      3 MiB |     63 MiB |     60 MiB |
|---------------------------------------------------------------------------|
| Allocations           |      80    |      98    |     806    |     726    |
|       from large pool |      22    |      41    |     347    |     325    |
|       from small pool |      58    |      76    |     459    |     401    |
|---------------------------------------------------------------------------|
| Active allocs         |      80    |      98    |     806    |     726    |
|       from large pool |      22    |      41    |     347    |     325    |
|       from small pool |      58    |      76    |     459    |     401    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      15    |      28    |      31    |      16    |
|       from large pool |      11    |      25    |      27    |      16    |
|       from small pool |       4    |       4    |       4    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      13    |      20    |     282    |     269    |
|       from large pool |       8    |      13    |     128    |     120    |
|       from small pool |       5    |       8    |     154    |     149    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

One thing I am confused about is that Allocated memory is smaller (although Non-releasable allocs increased). 3. The issue persists with and without torch.compile (note that we are using backend=aot_eager). The difference is that when we do torch.compile, we get OOM after the third update (only on RANK=0) and when we don’t do torch.compile, we get OOM after the first update (on several RANKs). The line that causes the OOM is the same. 4. When I was testing yesterday, I did not get OOM for the first ten update steps (so we might be really at the boundary).

I will continue looking at this, but if you have any recommendations, it would be helpful! Thank you again.

Update: I checked 1) that I can reproduce the same error with the reference_submission_tests.py and 2) what the largest batch size is that I can successfully run the workload with. It turns out it is 32768, which is exactly 1/8th of the previous training batch size – a bit suspicious since we are using 8 GPUs.

Also, the OOM error occurs during the 2nd optimizer update step.

Just updating here that Jane has access to our VMs now.

To quickly reproduce the Criteo bug I recommend using one of our pre-built docker images:

  1. pull the docker container w torch nightly (torch.dev08202023): docker pull [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing)
  2. Run the container in the background docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing) -a true This will print out a container ID.
  3. Bash into the container docker exec -it <container_id> /bin/bash
  4. Run the submission runner on criteo torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=criteo1tb --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/criteo1tb --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=criteo_pytorch_oom_debugging --overwrite=True --save_checkpoints=False --max_global_steps=10 --torch_compile=true

@janeyx99 I think our criteo data download and set up fixes are still in progress. In the meantime I can add you to our external GCP project and set you up with a VM to help debug this OOM. Let me know if that sounds like a good idea to you