pixelsplat: Weird Error: Out of memory error when running on V100 GPUs with the smaller batch
Hi! I met an error that I cannot understand. I can run the code on A10 (22G) and A100(40G) with a smaller batch size. But I cannot run it on V100(32G).
The error is weird:
Error executing job with overrides: ['+experiment=re10k']
Traceback (most recent call last):
File "/group/30042/ozhengchen/scene_gen/pixelsplat/src/main.py", line 123, in train
trainer.fit(model_wrapper, datamodule=data_module, ckpt_path=checkpoint_path)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage self._run_sanity_check()
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1062, in _run_sanity_check
val_loop.run()
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 391, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 402, in validation_step
return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in __call__
wrapper_output = wrapper_module(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 626, in wrapped_forward
out = method(*_args, **_kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
return fn(*args, **kwargs)
File "/group/30042/ozhengchen/scene_gen/pixelsplat/./src/model/model_wrapper.py", line 212, in validation_step
output_probabilistic = self.decoder.forward(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/jaxtyping/_decorator.py", line 409, in wrapped_fn
out = fn(*args, **kwargs)
File "/group/30042/ozhengchen/scene_gen/pixelsplat/./src/model/decoder/decoder_splatting_cuda.py", line 46, in forward
color = render_cuda(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/jaxtyping/_decorator.py", line 409, in wrapped_fn
out = fn(*args, **kwargs)
File "/group/30042/ozhengchen/scene_gen/pixelsplat/./src/model/decoder/cuda_splatting.py", line 117, in render_cuda
image, radii = rasterizer(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 210, in forward
return rasterize_gaussians(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 32, in rasterize_gaussians
return _RasterizeGaussians.apply(
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/group/30042/ozhengchen/ft_local/anaconda3/envs/pixelsplat/lib/python3.10/site-packages/diff_gaussian_rasterization/__init__.py", line 92, in forward
num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 131071.75 GiB. GPU 0 has a total capacty of 31.75 GiB of which 27.95 GiB is free. Process 88117 has 3.79 GiB memory in use. Of the allocated memory 1.66 GiB is allocated by PyTorch, and 340.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
It amazingly shows that 131071.75 GiB is needed. I do not know what bug it is when running the code on V100 GPU.
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Reactions: 1
- Comments: 15 (4 by maintainers)
@kevinYitshak Thanks for your advice! torch=2.1.2+cu118 works in my V100 environment.
I only modified the batch size to 2 (or 1) and the dataset path to my own path. The command line is unchanged:
python src/main.py +experiment=re10k
. I currently cannot figure out an appropriate configuration for V100.