tensorflow: tensor flow programs using gpu freeze instance
Environment info
Operating System: Ubuntu 14.04 on AWS g2.2xlarge
Installed version of CUDA and cuDNN: cuda-7.15 cudnn-4.0.7
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
lib/libcudart.so lib/libcudart.so.7.5 lib/libcudart.so.7.5.18 lib/libcudart_static.a
If installed from sources, provide the commit hash: commit bc5e961e1988fdefff8e8aa062f4ab3066c3a9e5 so tensorflow 0.8.0 but also tried 0.7.1 commit 028d0b46004c921acd48fdd0ec18128d79e18bf4
Steps to reproduce
all larger tensor flow scripts freeze. I can run simple examples like doing a matmul on the GPU but all larger programs, either my own or from the source (for example tensorflow/tensorflow/models/image/cifar10_train.py) freeze after a short time (no more output and not able to ctrl-C or ctrl-Z). Also the time of freeze seems to vary - I once made it through 2 epochs of training of my own NN before it froze.
example output:
python cifar10_train.py
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally Downloading cifar-10-binary.tar.gz 100.0% Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes. Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes. ^C ^Z ^C
and nothing happening (I did wait a lot longer than a few minutes before ctrl-C as well)
but this script here works and executes on GPU:
import tensorflow as tf
a = tf.constant([[3.,3.]])
b = tf.constant([[2.],[2.]])
c = tf.matmul(a,b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(c)
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 30 (8 by maintainers)
Just a follow up on this. I finally tracked this down to a faulty motherboard. Sorry for the false alarm. In case anyone else runs into an error case similar to mine and you happen to be using an Asrock X99 WS-E motherboard, it seems to be a common quality control problem they have.
Solved this problem by installing new driver from Nvidia site. Just download the latest driver for your card. The solution is from here: https://groups.google.com/d/msg/torch7/kLusyLEj4oc/MLRvcCy_FAAJ