tensorflow: tensorflow crashes when using large image with 3d convolutional network

I’m trying to implement a 3d fully convolutional network on my GPU. But for some reason I get a crash.

Environment info

Operating System: Ubuntu 14.04 LTS GPU: GeForce Titan X .

Installed version of CUDA and cuDNN: 8.0 and 5 (attach the output of `ls -l /path/to/cuda/lib/libcud* cud.filelist.txt )

I installed tensorflow version 0.11.0rc2, and it also reproduce in docker installation (gcr.io/tensorflow/tensorflow:latest-gpu)

Example code

The following code reproduce the problem:

import numpy as np
import tensorflow as tf

graph = tf.Graph()

with graph.as_default():
    tf_dataset = tf.placeholder(tf.float32, shape=(1, 512, 512, 512, 1))
    tf_label = tf.placeholder(tf.float32, shape=(1, 512, 512, 512, 1))

    layer1_weights = tf.Variable(tf.truncated_normal((2, 2, 2, 1, 1), stddev=0.1))
    layer1_bias = tf.Variable(tf.zeros(1))

    conv = tf.nn.conv3d(tf_dataset, layer1_weights, (1, 1, 1, 1, 1), padding='SAME')
    logits = tf.nn.relu(conv+layer1_bias)

    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_label))
    optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    batchData = np.random.rand(1, 512, 512, 512, 1).astype(np.float32)
    batchLabels = (np.random.rand(1, 512, 512, 512, 1)>0.5).astype(np.float32)
    feed_dict = {tf_dataset : batchData, tf_label : batchLabels}
    _ = session.run((optimizer, ), feed_dict=feed_dict)

with the following output:

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate (GHz) 1.076 pciBusID 0000:01:00.0 Total memory: 11.92GiB Free memory: 11.68GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0) F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] failed to enqueue convolution on stream: CUDNN_STATUS_NOT_SUPPORTED

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 48 (8 by maintainers)

Most upvoted comments

Can confirm this issue as well. I’m having exactly the same problem! Would be great if this could be resolved 😃

NilsWinter on Jan 26, 2017

@prb12, @zheng-xq any updates on this issue?

deepak09027 on Feb 6, 2017

can confirm a similar issue with large 3D convolutions:

inshape = (2,258,258,34,1)  # tf img-order
filter_shape = (3,3,3,1,32)
x = tf.placeholder('float32', shape=inshape, name='X')
f = tf.placeholder('float32', shape=filter_shape, name='filter')
c = tf.nn.conv3d(x, f, padding='VALID', strides=[1,1,1,1,1])
grads = tf.gradients(c,f)[0]

xx = np.random.rand(*inshape)
ff = np.random.rand(*filter_shape)

with tf.Session().as_default():
    Q = c.eval(feed_dict={x:xx, f:ff})
    gradQ = grads.eval(feed_dict={x:xx, f:ff})

will yield tensorflow/stream_executor/cuda/cuda_dnn.cc:2674] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED. It runs fine if I set inshape = (2,257,257,34,1) instead, so there’s some issue with the sizes. However I dont think its the usual GPU-out-of-memory issue:

the same convolution runs perfectly fine in theano (there I can even increase to (2,300,300,34,1) without trouble)
it doesnt respond to all dimension the same: decreasing the batchsize from 2->1 will still crash even though we just halfed the input

Turns out that it’s somehow related to the gradient computations, i.e. commenting the gradQ =... line, just evaluating the result of the convolutions works!

Code was run on tensorflow 0.11.0rc2 (same happens for 0.12.1), Titan X, CUDA-8.0, cudnn 5

redst4r on Jan 8, 2017

Similar problem here with deep 2d convolution network and huge batch size (because my input is [seq_length * batch_size, height, width] ).

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:02:00.0
Total memory: 11.90GiB
Free memory: 11.75GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3a9e500
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 4.99GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
F tensorflow/stream_executor/cuda/cuda_dnn.cc:2684] failed to enqueue convolution on stream: CUDNN_STATUS_NOT_SUPPORTED
Aborted (core dumped)

Change batch size to a smaller one solves the issue. It’d great if it can throw a python exception : )

poweic on Jan 30, 2017

@deepak09027 Unfortunately I dont have any solution yet. I was hoping that some googler picks it up here 😃

redst4r on Jan 26, 2017

@HggsHntr I’m just a simple computer scientist. Your questions are reasonable, but it’s hard for me to think beyond testing whether a smaller image works or not.

girving on Nov 18, 2016

I am having the same issue as well

raghakot on Feb 17, 2017