algorithmic-efficiency: Pytorch Criteo CUDA error
Branch: dev Test link: https://github.com/mlcommons/algorithmic-efficiency/actions/runs/5416731116/jobs/9846848568 For details expand criteo_pytorch and then expand Run Containerized Workload.
Description
Criteo Pytorch OOMs.
Traceback:
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 171, in step
adamw(
File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 321, in adamw
func(
File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 3; 15.78 GiB total capacity; 12.14 GiB already allocated; 307.44 MiB free; 14.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps to Reproduce
On kasimbeg-3
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_dev
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_dev -d criteo1tb -f pytorch -s baselines/adamw/pytorch/submission.py -w criteo1tb -t baselines/adamw/tuning_search_space.json -e test_today/adamw -m 10 -c False -o True -r false
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (19 by maintainers)
I tracked down the issue and will send a fix soon (with a detailed explanation of why we had this OOM issue).
I think we should do some simple ablations first, since it did work recently without issues. @pomonam Have you tried running the code from the
devbranch with PyTorch 1.13.1 instead of 2.0.1? To check if the issue is actually just due to the PyTorch update.Hi @janeyx99, thank you for the comment! We did try setting
fused=False, and this solved the OOM issue for some other workloads but not for this one. In my previous comment, I thought I solved the issue, but when creating another docker and rerunning the code, it seems like the issue persists. I am still debugging this, but I thought it would be useful to share what I have found so far and get feedbacks (in case you have any).The code runs fine for three update steps. But, we are facing OOM at the fourth update step:
torch.cuda.empty_cache()after each gradient step solves the OOM issue. However, I am guessing this is not the ideal solution, as it can significantly slow down the training.torch.cuda.memory_summary()just before the thirdoptimizer_state['optimizer'].step()(which does not raise OOM):And this is the output just before the fourth
optimizer_state['optimizer'].step()(which we get the error message above):One thing I am confused about is that
Allocated memoryis smaller (althoughNon-releasable allocsincreased). 3. The issue persists with and withouttorch.compile(note that we are usingbackend=aot_eager). The difference is that when we dotorch.compile, we get OOM after the third update (only on RANK=0) and when we don’t dotorch.compile, we get OOM after the first update (on several RANKs). The line that causes the OOM is the same. 4. When I was testing yesterday, I did not get OOM for the first ten update steps (so we might be really at the boundary).I will continue looking at this, but if you have any recommendations, it would be helpful! Thank you again.
Update: I checked 1) that I can reproduce the same error with the
reference_submission_tests.pyand 2) what the largest batch size is that I can successfully run the workload with. It turns out it is32768, which is exactly 1/8th of the previous training batch size – a bit suspicious since we are using 8 GPUs.Also, the OOM error occurs during the 2nd optimizer update step.
Just updating here that Jane has access to our VMs now.
To quickly reproduce the Criteo bug I recommend using one of our pre-built docker images:
docker pull [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing)docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing) -a trueThis will print out a container ID.docker exec -it <container_id> /bin/bashtorchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=criteo1tb --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/criteo1tb --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=criteo_pytorch_oom_debugging --overwrite=True --save_checkpoints=False --max_global_steps=10 --torch_compile=true@janeyx99 I think our criteo data download and set up fixes are still in progress. In the meantime I can add you to our external GCP project and set you up with a VM to help debug this OOM. Let me know if that sounds like a good idea to you