stable-dreamfusion: 3D object generation is slower and getting NaNs with the flag --cuda_ray

Description

Hi, I am using RTX 2080 16 GB (laptop version) and during generating a 3D object I am getting NaN pretty quickly (during the first epoch). Moreover, the training takes longer, i.e., with --cuda_ray it is around 2.8it/s, while using PyTorch, it is around 3.1it/s.

I installed everything following the description in the readme.md and had no issues.

Steps to Reproduce

Execute the script: python main.py --text "a beef hamburger on a ceramic plate" --workspace trial -O.

Then I am getting in the console:

==> Start Training trial Epoch 1, lr=0.050000 ...
  0% 0/100 [00:00<?, ?it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   1% 1/100 [00:00<01:24,  1.17it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   2% 2/100 [00:01<01:00,  1.62it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   3% 3/100 [00:01<00:53,  1.83it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   4% 4/100 [00:02<00:49,  1.94it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   5% 5/100 [00:02<00:47,  2.01it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   6% 6/100 [00:03<00:45,  2.05it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   7% 7/100 [00:03<00:44,  2.08it/s]NaN or Inf found in input tensor.

While only changing line number 92 in main.py to: opt.cuda_ray = False

I am getting in the console:

==> Start Training trial Epoch 1, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:31<00:00,  3.14it/s]
==> Finished Epoch 1.
  0% 0/5 [00:00<?, ?it/s]++> Evaluate trial_sd_xffa at epoch 1 ...
loss=0.0000 (0.0000): : 100% 5/5 [00:02<00:00,  1.76it/s]
++> Evaluate epoch 1 Finished.
==> Start Training trial_sd_xffa Epoch 2, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:32<00:00,  3.12it/s]
==> Finished Epoch 2.

and I am able to generate a nice 3D object.

Expected Behavior

Not getting NaNs during training and speedup training.

Environment

Ubuntu 20.04, conda environment, Python 3.10, PyTorch 1.13.1, CUDA 11.7.1, cudnn 8.5.0

About this issue

Original URL
State: closed
Created a year ago
Comments: 16 (6 by maintainers)

Most upvoted comments

The speed issue can be somehow explained: during the training NeRF rendering is not the major speed bottleneck (the stable-diffusion denoising step is). I get similar training speed (5its/s) on V100 with cuda_ray on or off, but the cuda-ray mode should be faster in rendering (5its/s vs 1it/s). Besides, the non-cuda-ray mode currently only samples 64+32 points per ray, which is relatively fewer than cuda-ray mode (at most 1024, but on average ~100 points per ray).

ashawkey on Feb 15, 2023

I also updated the Nvidia drivers to the latest 525.x.x and the NaNs issue is gone. Now it does not occur in both modes: --cuda_ray and PyTorch raymarching.

Moreover, training is now faster for --cuda_ray. I achieve 3.5 it/s (vs. 2.8 it/s using older drivers) in --cuda_ray mode, which is a similar result to PyTorch raymarching.

marcinplata on Feb 18, 2023

On my side at least, NaNs were caused in run_cuda because raymarching.composite_rays_train(sigmas, rgbs, ts, rays, T_thresh) returned an empty weights tensor. I’m pretty sure it’s caused by the old NVIDIA 510.x driver that’s installed on that machine. I’m now trying to debug the non-CUDA, FP32 code path since it doesn’t produce the results I’m expecting in my scenario.

claforte on Feb 15, 2023

What NVIDIA driver do you have for each environment? I think older versions might the root cause. It works fine for me on 525.x, but fails on 510.x

claforte on Feb 14, 2023

I updated the A100’s venv to use the exact same versions of Python (3.9.13) and pip packages as my local RTX4090… but unfortunately that didn’t resolve the problem. I’m starting to wonder if the problem is specific to the NVIDIA driver (510.73.08)…

claforte on Feb 14, 2023