KPConv: Problems during training.

Hi, Thanks for your sharing. I have tried your code on my own dataset but the I found that initially everything goes well but after several epochs the training suddenly broke up ( accuracy becomes 1 and the loss becomes 0 ) I use tf 1.12.0 and the cuda version is 9.0, cudnn version is 7.1.4

# conda list | grep tensorflow
tensorflow-estimator      1.13.0                     py_0    anaconda
tensorflow-gpu            1.12.0                   pypi_0    pypi
tensorflow-tensorboard    0.4.0                    pypi_0    pypi

Have you met this kind of problem? Another potential problem is that sometimes the training takes 4400 MB GPU memory (see from nvidia-smi), but sometimes it takes more than 7000 MB ( and I do not change the batch size and network architecture) I am pretty confused about these problems. Could you give me some advice?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (4 by maintainers)

Most upvoted comments

Hi,

I had some time to dig into this problem and it seems that CUDA10 is not working correctly with RTX 2080Ti GPUs. Here is what I found:

Tested configurations

  • CUDA9-TF1.12 / GTX 1080ti => No bug
  • CUDA10-TF1.13 / GTX 1080ti => No bug
  • CUDA9-TF1.12 / RTX 2080ti => No bug
  • CUDA10-TF1.13 / RTX 2080ti => Bug appears only in this configuration

Origin of the bug I tracked down the NaN values in my code and found that they appear after a tf.matmul operation: https://github.com/HuguesTHOMAS/KPConv/blob/5f9cecae72c7a22cc4247b7852eee707ecab8fcd/kernels/convolution_ops.py#L240

Before the appearance of NaN I noticed some weird value higher than 1e10. If you print the two matrix that are multiplied and the result matrix, you will see that the result is completly false. This seems to be caused by a CUDA internal bug. At some point one of this mistake lead to a value so high that it becomes NaN and the network crashes.

For now I would just advise to avoid using CUDA10 with a RTX 2080ti.