stable-dreamfusion: 3D object generation is slower and getting NaNs with the flag --cuda_ray
Description
Hi, I am using RTX 2080 16 GB (laptop version) and during generating a 3D object I am getting NaN pretty quickly (during the first epoch). Moreover, the training takes longer, i.e., with --cuda_ray
it is around 2.8it/s, while using PyTorch, it is around 3.1it/s.
I installed everything following the description in the readme.md
and had no issues.
Steps to Reproduce
Execute the script:
python main.py --text "a beef hamburger on a ceramic plate" --workspace trial -O
.
Then I am getting in the console:
==> Start Training trial Epoch 1, lr=0.050000 ...
0% 0/100 [00:00<?, ?it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 1% 1/100 [00:00<01:24, 1.17it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 2% 2/100 [00:01<01:00, 1.62it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 3% 3/100 [00:01<00:53, 1.83it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 4% 4/100 [00:02<00:49, 1.94it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 5% 5/100 [00:02<00:47, 2.01it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 6% 6/100 [00:03<00:45, 2.05it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: : 7% 7/100 [00:03<00:44, 2.08it/s]NaN or Inf found in input tensor.
While only changing line number 92 in main.py
to:
opt.cuda_ray = False
I am getting in the console:
==> Start Training trial Epoch 1, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:31<00:00, 3.14it/s]
==> Finished Epoch 1.
0% 0/5 [00:00<?, ?it/s]++> Evaluate trial_sd_xffa at epoch 1 ...
loss=0.0000 (0.0000): : 100% 5/5 [00:02<00:00, 1.76it/s]
++> Evaluate epoch 1 Finished.
==> Start Training trial_sd_xffa Epoch 2, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:32<00:00, 3.12it/s]
==> Finished Epoch 2.
and I am able to generate a nice 3D object.
Expected Behavior
Not getting NaNs during training and speedup training.
Environment
Ubuntu 20.04, conda environment, Python 3.10, PyTorch 1.13.1, CUDA 11.7.1, cudnn 8.5.0
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (6 by maintainers)
The speed issue can be somehow explained: during the training NeRF rendering is not the major speed bottleneck (the stable-diffusion denoising step is). I get similar training speed (5its/s) on V100 with
cuda_ray
on or off, but the cuda-ray mode should be faster in rendering (5its/s vs 1it/s). Besides, the non-cuda-ray mode currently only samples 64+32 points per ray, which is relatively fewer than cuda-ray mode (at most 1024, but on average ~100 points per ray).I also updated the Nvidia drivers to the latest 525.x.x and the NaNs issue is gone. Now it does not occur in both modes:
--cuda_ray
and PyTorch raymarching.Moreover, training is now faster for
--cuda_ray
. I achieve 3.5 it/s (vs. 2.8 it/s using older drivers) in--cuda_ray
mode, which is a similar result to PyTorch raymarching.On my side at least, NaNs were caused in
run_cuda
becauseraymarching.composite_rays_train(sigmas, rgbs, ts, rays, T_thresh)
returned an emptyweights
tensor. I’m pretty sure it’s caused by the old NVIDIA 510.x driver that’s installed on that machine. I’m now trying to debug the non-CUDA, FP32 code path since it doesn’t produce the results I’m expecting in my scenario.What NVIDIA driver do you have for each environment? I think older versions might the root cause. It works fine for me on 525.x, but fails on 510.x
I updated the A100’s venv to use the exact same versions of Python (3.9.13) and pip packages as my local RTX4090… but unfortunately that didn’t resolve the problem. I’m starting to wonder if the problem is specific to the NVIDIA driver (510.73.08)…