keras: Crash when using multi_gpu_model and n_sample is not a multiple of batch_size
I had this error when trying to fit a multi_gpu_model
that fits just fine on a single GPU:
F tensorflow/stream_executor/cuda/cuda_dnn.cc:522] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 0 feature_map_count: 16 spatial: 128 128 128 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
After some investigations (because I was trying to fill a decent bug report), it turns out this happens when I try to fit it using a batch_size that is not a divisor of the number of samples in my dataset.
I originally asked for help on SO here, where you can read more details. Maybe this cannot be changed, but my suggestion would be to have a more intelligible error message for a regular human like me. ;o)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 8
- Comments: 21
I think it’s worth noting, since this issue pops up first when googling this error: it also happens in general case, when one tries to pass too small spatial input, presumably, for pooling layers.
I am also having this problem if anyone can help. When my code reaches the train_on_batch(X,Y) i get this error:
2019-03-17 19:16:58.883468: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 2 feature_map_count: 76 spatial: 0 120 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
Than my code crashes.
I believe the solution to my multi GPU issue was to feed a sample size that is a multiple of how many GPUs I was using, so the data is fed into the model in pairs for a 2 GPU setup.
I did not debug too much, but I got this error when my input-images had too little resolution (in particular 41 x 41 pixels). I would guess all the down-pooling leads to a 0x0 dimension, which crashes python. An error would be nicer though.
Minimal failing colab: https://colab.research.google.com/drive/11wClV9iD1IVcu09zCn6jA-d1_FhsVvtl
Update: My issue of multigpu support was fixed. I was using a batch size of 64. For the last batch, I had 38 images. However, my generator was giving 64 images (in each batch, the images were casted into a new variable that had 64 as 1st dimension) and 38 labels. Fixing the images to 38 to labels already to 38, the multigpu was automatically fixed.
The weird thing is that single gpu was still working and I only figured out this issue when the 1st epoch was about to end and the last batch was processed. This implies that somehow, when enabling multigpu support, there’s some additional logic that makes these ‘forward’ checks before even beginning the first epoch.
My configuration: Keras: 2.2.4 Python: 3.6.8 Machine: Google VM CPU Platform: Intel Haswell GPU: 8x Nvidia K80 RAM: 208 GB
More so, it will also happen, if the validation sample size is not a multiple of batch size. I haven’t had this before (could be luck) but only after upgrading to the latest cudnn version.