tensorflow: tensor flow programs using gpu freeze instance

Environment info

Operating System: Ubuntu 14.04 on AWS g2.2xlarge

Installed version of CUDA and cuDNN: cuda-7.15 cudnn-4.0.7 (please attach the output of ls -l /path/to/cuda/lib/libcud*): lib/libcudart.so lib/libcudart.so.7.5 lib/libcudart.so.7.5.18 lib/libcudart_static.a

If installed from sources, provide the commit hash: commit bc5e961e1988fdefff8e8aa062f4ab3066c3a9e5 so tensorflow 0.8.0 but also tried 0.7.1 commit 028d0b46004c921acd48fdd0ec18128d79e18bf4

Steps to reproduce

all larger tensor flow scripts freeze. I can run simple examples like doing a matmul on the GPU but all larger programs, either my own or from the source (for example tensorflow/tensorflow/models/image/cifar10_train.py) freeze after a short time (no more output and not able to ctrl-C or ctrl-Z). Also the time of freeze seems to vary - I once made it through 2 epochs of training of my own NN before it froze.

example output:

python cifar10_train.py

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally Downloading cifar-10-binary.tar.gz 100.0% Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes. Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. ^C ^Z ^C

and nothing happening (I did wait a lot longer than a few minutes before ctrl-C as well)

but this script here works and executes on GPU:

import tensorflow as tf
a = tf.constant([[3.,3.]])
b = tf.constant([[2.],[2.]])
c = tf.matmul(a,b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 30 (8 by maintainers)

Most upvoted comments

Just a follow up on this. I finally tracked this down to a faulty motherboard. Sorry for the false alarm. In case anyone else runs into an error case similar to mine and you happen to be using an Asrock X99 WS-E motherboard, it seems to be a common quality control problem they have.

Solved this problem by installing new driver from Nvidia site. Just download the latest driver for your card. The solution is from here: https://groups.google.com/d/msg/torch7/kLusyLEj4oc/MLRvcCy_FAAJ

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

And then I installed everything except the samples and driver :

chmod o+x cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run

And then I downloaded and installed the new driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run&lang=us&type=GeForce :

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run
chmod o+x NVIDIA-Linux-x86_64-361.28.run
sudo ./NVIDIA-Linux-x86_64-361.28.run