tensorflow: NaN problem on tutorials_example_trainer with tensorflow 0.8

I worked for several months on versions of tensorflow installed from sources. Today, I tried to upgrade it to 0.8. The installation seemed to go fine. “bazel-bin/tensorflow/cc/tutorials_example_trainer” gave the expected result. However, the output of “bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu” is unexpected: the first lines seem to be fine but then NaN values appear, the proportion of NaN values increases until there is nothing but NaN values (see the first half of the output here http://pastebin.com/EhuD2Z5S , the remaining lines are only “nan” lines which I omitted because the total output exceeded pastebin limit).

I expected this tutorials_example_trainer to either work seamlessly or not at all (if I provided wrong paths for CUDA and CuDNN), but this in-between puzzles me.

I tried to debug it with my own programs, but I can not import tensorflow :

import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "tensorflow/python/__init__.py", line 45, in <module>
    from tensorflow.python import pywrap_tensorflow
ImportError: cannot import name pywrap_tensorflow

I am not sure whether the two problems are related.

In the case I made a mistake during configuration, I typed:

matthieu@gpu2:~/external/tensorflow$ ./configure 
Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 7.0
Please specify the location where CUDA 7.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 4
Please specify the location where cuDNN 4 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished"

I hope this issue belongs here, else I will try stackoverflow.

Thanks in advance,

Environment info

I work on ubuntu 14.04

With CUDA 7.0 and cuDNN 4

ls /usr/local/cuda-7.0/lib/libcud*        
/usr/local/cuda-7.0/lib/libcudadevrt.a  /usr/local/cuda-7.0/lib/libcudart.so.7.0     /usr/local/cuda-7.0/lib/libcudart_static.a
/usr/local/cuda-7.0/lib/libcudart.so    /usr/local/cuda-7.0/lib/libcudart.so.7.0.28

Commit hash: e1c017624fd6bd8562c1da27e0014b32f389b524

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 30 (12 by maintainers)

Most upvoted comments

Found the issue when I searched the same problem. I git pulled the latest source and got this same issue: tutorials_example_trainer output some NaNs. Here are the last few lines of the output.

000004/000005 lambda =      nan x = [     nan      nan] y = [     nan      nan]
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000004/000003 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000001/000007 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000009 lambda =      nan x = [     nan      nan] y = [     nan      nan]

I’m on Ubuntu 14.04 LTS 64 bit, gcc 4.8.4, CUDA 7.5.