tensorflow: Tensorflow crashes python when using Keras Convolution/Max Pooling layers on rtx 3070

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Pro
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below): pip install tensorflow-gpu==2.4.0-rc0
Python version: 3.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.1/8
GPU model and memory: RTX 3070 (8gb)

Current Behavior Running “Code 1” [in standalone code sect.] works fine and engages the GPU, obviously the results are garbage but it is an illustrative example.

By adding either a convolutional layer, max pooling layer, or both and running “Code 2” [in standalone code sect.] causes python to crash and yields the terminal output in the “other info” section. The error comes when the model.fit(...) call is made; the model is able to compile successfully but crashes on the first epoch of training.

Expected behavior Since the standard sequential ANN trains fine on the GPU i would expect the CNN to be able to train as well.

Standalone code to reproduce the issue

Code 1 - works fine

from tensorflow.keras import datasets, layers, models

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

print(tf.config.list_physical_devices('GPU'))

model = models.Sequential()
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.compile(optimizer='Adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

history = model.fit(train_images, train_labels, batch_size=1, epochs=100)

Code 2 - crashes when .fit(…) is called on model

from tensorflow.keras import datasets, layers, models

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

print(tf.config.list_physical_devices('GPU'))

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.compile(optimizer='Adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

history = model.fit(train_images, train_labels, batch_size=1, epochs=100)

Other info / logs

2020-12-11 14:28:08.841765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-11 14:28:11.095172: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-12-11 14:28:11.096076: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2020-12-11 14:28:11.124073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3070 computeCapability: 8.6
coreClock: 1.77GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-12-11 14:28:11.124461: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-11 14:28:11.136398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-11 14:28:11.136622: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-11 14:28:11.139581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-11 14:28:11.140535: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-11 14:28:11.147342: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-11 14:28:11.149809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-11 14:28:11.150352: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-11 14:28:11.150569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2020-12-11 14:28:11.153569: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-11 14:28:11.154591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3070 computeCapability: 8.6
coreClock: 1.77GHz coreCount: 46 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-12-11 14:28:11.154972: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-11 14:28:11.155203: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-11 14:28:11.155383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-11 14:28:11.155567: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-11 14:28:11.155753: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-11 14:28:11.155934: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-11 14:28:11.156112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-11 14:28:11.156288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-11 14:28:11.156488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-11 14:28:11.595698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-11 14:28:11.595908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2020-12-11 14:28:11.596028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2020-12-11 14:28:11.596248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6177 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3070, pci bus id: 0000:01:00.0, compute capability: 8.6)
2020-12-11 14:28:11.597093: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-12-11 14:28:12.139175: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/100
2020-12-11 14:28:12.386035: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-11 14:28:13.019914: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-11 14:28:13.024711: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
Process finished with exit code -1073740791 (0xC0000409)

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (3 by maintainers)

Most upvoted comments

@ymodak UPDATE:

For whatever reason when run in the IDE terminal an error message was being suppressed and Process finished with exit code -1073740791 (0xC0000409) was logged as the error message.

When run from the command line the below error messages were displayed instead of logging the exit code error.

Could not load library cudnn_ops_infer64_8.dll. Error code 126
Please make sure cudnn_ops_infer64_8.dll is in your library path!

I recognized this was a package included in the cudnn library and copy and pasted it from the bin folder in cudnn to NVIDIA GPU computing toolkit > CUDA > V11.0 > bin. This process was repeated for the below packages and the issue was resolved.

cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
cudnn_cnn_infer64_8.dll
cudnn_cnn_train64_8.dll
cudnn_ops_infer64_8.dll
cudnn_ops_train64_8.dll

+21

taylrcawte on Jan 10, 2021

The problem is caused due to OOM issues My GPU RTX 3060 jumped to 12 GB as soon as a started training a simple CNN-

The only thing that helped in my case was: pip install tf-nightly-gpu

Resolved all the problems

AnelMusic on Dec 9, 2021

@ymodak UPDATE:

For whatever reason when run in the IDE terminal an error message was being suppressed and Process finished with exit code -1073740791 (0xC0000409) was logged as the error message.

When run from the command line the below error messages were displayed instead of logging the exit code error.
Could not load library cudnn_ops_infer64_8.dll. Error code 126
Please make sure cudnn_ops_infer64_8.dll is in your library path!
I recognized this was a package included in the cudnn library and copy and pasted it from the bin folder in cudnn to NVIDIA GPU computing toolkit > CUDA > V11.0 > bin. This process was repeated for the below packages and the issue was resolved.
cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
cudnn_cnn_infer64_8.dll
cudnn_cnn_train64_8.dll
cudnn_ops_infer64_8.dll
cudnn_ops_train64_8.dll

After struggling to get this work for the last 3 weeks, Installing different versions of CUDA and Tensorflow over and over again finally it worked.

I can’t believe I was making such a silly mistake over and over again the solution was right in front of me but I chose to ignore it.

Thank you @taylrcawte for sharing the update.

siddharth2395 on Aug 16, 2021