tensorflow: Backprop through conv2d with large tensors fails

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.1.0
Python version: 3.5.2
Bazel version (if compiling from source): 0.4.5
CUDA/cuDNN version: CUDA Version 8.0.61
GPU model and memory: GeForce GTX 1080, 8GB
Exact command to reproduce:

import tensorflow as tf

proj = tf.Variable( tf.random_normal([720,1,400,600], stddev = 2) )
kernel = tf.Variable( tf.random_normal([1, 401, 1, 1], stddev = .5), trainable = True )
proj = tf.nn.conv2d(
    input = proj,
    filter = kernel,
    strides = [ 1, 1, 1, 1 ],
    padding = 'SAME',
    data_format = 'NCHW',
    name = 'ramlak-filter'
)
grad = tf.gradients( proj, kernel, proj )

with tf.Session() as sess:
    sess.run( tf.global_variables_initializer() )
    print( sess.run( grad ) )

Describe the problem

Backpropagation through conv2d for large tensors produces an error for me (see log below). This is true for the code above. Lowering the batch size in proj from 720 to 100 eliminates the error. The issue has been confirmed by another user at stackoverflow.

Source code / logs

NotFoundError (see above for traceback): No algorithm without scratch worked!
	 [[Node: gradients/ramlak-filter_grad/Conv2DBackpropFilter = Conv2DBackpropFilter[T=DT_FLOAT, data_format="NCHW", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable/read, gradients/ramlak-filter_grad/Shape_1, ramlak-filter)]]
[log.txt](https://github.com/tensorflow/tensorflow/files/1128268/log.txt)

Please find the full log attached.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 29 (1 by maintainers)

Most upvoted comments

Hi @ma0ho Some updates from our side:

This bug is due to one cudnn function didn’t pick the correct internal kernel to call for backward filter conv. We have already filed a bug to NVIDIA.
We are working on a workaround to improve the internal logic of conv ops to handle this case. I will let you know once the code is pushed to github repo.

yzhwang on Jul 24, 2017

Just updated to TF 1.2.1 (built from source), but the issue persists.

ma0ho on Jul 6, 2017

Same error to conv3d when setting one dimension of the filter to 1. The real error for me is out of memory, when I change the filter size the error disappear.

lidongyv on Jul 29, 2017

@yzhwang: No, exactly as given in the example I’m using a 1D filter width 401 with one in and one output channel… I’m sure this is not an OOM problem.

ma0ho on Jul 18, 2017