tensorflow: tensor flow programs using gpu freeze instance

Environment info

Operating System: Ubuntu 14.04 on AWS g2.2xlarge

Installed version of CUDA and cuDNN: cuda-7.15 cudnn-4.0.7 (please attach the output of ls -l /path/to/cuda/lib/libcud*): lib/libcudart.so lib/libcudart.so.7.5 lib/libcudart.so.7.5.18 lib/libcudart_static.a

If installed from sources, provide the commit hash: commit bc5e961e1988fdefff8e8aa062f4ab3066c3a9e5 so tensorflow 0.8.0 but also tried 0.7.1 commit 028d0b46004c921acd48fdd0ec18128d79e18bf4

Steps to reproduce

all larger tensor flow scripts freeze. I can run simple examples like doing a matmul on the GPU but all larger programs, either my own or from the source (for example tensorflow/tensorflow/models/image/cifar10_train.py) freeze after a short time (no more output and not able to ctrl-C or ctrl-Z). Also the time of freeze seems to vary - I once made it through 2 epochs of training of my own NN before it froze.

example output:

python cifar10_train.py

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally Downloading cifar-10-binary.tar.gz 100.0% Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes. Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. ^C ^Z ^C

and nothing happening (I did wait a lot longer than a few minutes before ctrl-C as well)

but this script here works and executes on GPU:

import tensorflow as tf
a = tf.constant([[3.,3.]])
b = tf.constant([[2.],[2.]])
c = tf.matmul(a,b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 30 (8 by maintainers)

Most upvoted comments

Just a follow up on this. I finally tracked this down to a faulty motherboard. Sorry for the false alarm. In case anyone else runs into an error case similar to mine and you happen to be using an Asrock X99 WS-E motherboard, it seems to be a common quality control problem they have.

dojoteef on Nov 3, 2016

Solved this problem by installing new driver from Nvidia site. Just download the latest driver for your card. The solution is from here: https://groups.google.com/d/msg/torch7/kLusyLEj4oc/MLRvcCy_FAAJ

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

And then I installed everything except the samples and driver :

chmod o+x cuda_7.5.18_linux.run
sudo ./cuda_7.5.18_linux.run

And then I downloaded and installed the new driver from http://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run&lang=us&type=GeForce :

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/361.28/NVIDIA-Linux-x86_64-361.28.run
chmod o+x NVIDIA-Linux-x86_64-361.28.run
sudo ./NVIDIA-Linux-x86_64-361.28.run

mbektimirov on Aug 31, 2016