keras: Crash when using multi_gpu_model and n_sample is not a multiple of batch_size

I had this error when trying to fit a multi_gpu_model that fits just fine on a single GPU:

F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128  value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

After some investigations (because I was trying to fill a decent bug report), it turns out this happens when I try to fit it using a batch_size that is not a divisor of the number of samples in my dataset.

I originally asked for help on SO here, where you can read more details. Maybe this cannot be changed, but my suggestion would be to have a more intelligible error message for a regular human like me. ;o)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 8
  • Comments: 21

Most upvoted comments

I think it’s worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.

I am also having this problem if anyone can help. When my code reaches the train_on_batch(X,Y) i get this error:

2019-03-17 19:16:58.883468: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 76 spatial: 0 120 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

Than my code crashes.

I believe the solution to my multi GPU issue was to feed a sample size that is a multiple of how many GPUs I was using, so the data is fed into the model in pairs for a 2 GPU setup.

I did not debug too much, but I got this error when my input-images had too little resolution (in particular 41 x 41 pixels). I would guess all the down-pooling leads to a 0x0 dimension, which crashes python. An error would be nicer though.

Minimal failing colab: https://colab.research.google.com/drive/11wClV9iD1IVcu09zCn6jA-d1_FhsVvtl

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K

base_model = InceptionV3(weights='imagenet', include_top=False)

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

import numpy
from sklearn.model_selection import train_test_split

X = numpy.random.rand(500, 41, 41, 3)

y = numpy.random.rand(500) > .5

X_train, X_test, y_train, y_test = train_test_split(X, y)

model.fit(X_train[:2], y_train[:2])

# Log error:
# Jul 8, 2019, 12:01:43 PM	WARNING	2019-07-08 10:01:43.055132: F tensorflow/stream_executor/cuda/cuda_dnn.cc:516] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 288 spatial: %d 0%d 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}

I can confirm this issue on my machine as well. After looking at this post, I ran with the single gpu configuration and it started working. With multi-gpu (8x Nvidia K80), it doesn’t work.

Update: My issue of multigpu support was fixed. I was using a batch size of 64. For the last batch, I had 38 images. However, my generator was giving 64 images (in each batch, the images were casted into a new variable that had 64 as 1st dimension) and 38 labels. Fixing the images to 38 to labels already to 38, the multigpu was automatically fixed.

The weird thing is that single gpu was still working and I only figured out this issue when the 1st epoch was about to end and the last batch was processed. This implies that somehow, when enabling multigpu support, there’s some additional logic that makes these ‘forward’ checks before even beginning the first epoch.

My configuration: Keras: 2.2.4 Python: 3.6.8 Machine: Google VM CPU Platform: Intel Haswell GPU: 8x Nvidia K80 RAM: 208 GB

More so, it will also happen, if the validation sample size is not a multiple of batch size. I haven’t had this before (could be luck) but only after upgrading to the latest cudnn version.