tensorflow: Tutorial code freezes indefinitely on TF 2.4 with tf.function

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Education 1909 64 bit
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Not used
TensorFlow installed from (source or binary): pip install tensorflow==2.4.0rc1
TensorFlow version (use command below): 2.4.0-rc1(v2.4.0-rc0-30-gef82f4c66c)
Python version: 3.7.9
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: CUDA 11.0.3/cuDNN 8.0.2
GPU model and memory: RTX 2080ti 11 GB

Describe the current behavior When running the cyclegan tutorial on a local machine the program freezes during the training loop. It successfully executes two train_step before freezing indefinitely during the third train_step. This does not happen with TF 2.3.1 and if @tf.function is removed from the function train_step it also does not happen.

Describe the expected behavior The program should not freeze.

Standalone code to reproduce the issue Download the notebook from https://www.tensorflow.org/tutorials/generative/cyclegan and run it on jupyter notebook with TF 2.4 on Windows.

Other info / logs I also tried using python versions 3.6 and 3.8 as well as cuDNN version 8.0.5 and tensorflow versions 2.4.0rc0 and 2.5.0-dev20201029 with the same results.

No errors are printed when the program halts, this is the complete log: 2020-11-12 13:23:14.333154: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-11-12 13:23:22.692189: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2020-11-12 13:23:22.695520: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2020-11-12 13:23:22.764684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:21:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s 2020-11-12 13:23:22.770470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: pciBusID: 0000:4a:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s 2020-11-12 13:23:22.776596: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-11-12 13:23:23.156041: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-11-12 13:23:23.159089: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-11-12 13:23:23.197670: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-11-12 13:23:23.224469: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-11-12 13:23:23.423940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-11-12 13:23:23.606110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-11-12 13:23:24.736751: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-11-12 13:23:24.739994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1 2020-11-12 13:23:24.742445: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-11-12 13:23:25.086293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:21:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s 2020-11-12 13:23:25.091969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: pciBusID: 0000:4a:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5 coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s 2020-11-12 13:23:25.098150: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-11-12 13:23:25.101109: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-11-12 13:23:25.104096: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-11-12 13:23:25.107660: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-11-12 13:23:25.110613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-11-12 13:23:25.114070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-11-12 13:23:25.117107: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-11-12 13:23:25.120321: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-11-12 13:23:25.124087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1 2020-11-12 13:23:25.855306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-11-12 13:23:25.858740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 1 2020-11-12 13:23:25.860826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N N 2020-11-12 13:23:25.862739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1: N N 2020-11-12 13:23:25.865414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8581 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-11-12 13:23:25.871513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8581 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:4a:00.0, compute capability: 7.5) 2020-11-12 13:23:25.877802: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2020-11-12 13:23:26.305762: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2020-11-12 13:23:26.553589: W tensorflow/core/kernels/data/cache_dataset_ops.cc:757] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 2020-11-12 13:23:26.563266: W tensorflow/core/kernels/data/cache_dataset_ops.cc:757] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 2020-11-12 13:23:26.940273: W tensorflow/core/kernels/data/cache_dataset_ops.cc:757] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 2020-11-12 13:23:26.951226: W tensorflow/core/kernels/data/cache_dataset_ops.cc:757] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 2020-11-12 13:23:30.401325: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-11-12 13:23:33.790176: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0 2020-11-12 13:23:33.839252: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0 2020-11-12 13:23:33.852673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-11-12 13:23:34.345317: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (4 by maintainers)

Most upvoted comments

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] on Dec 2, 2020

@zetez I’ve tried setting the ‘CUDA_LAUNCH_BLOCKING’ environment variable to ‘1’ and that seems to have resolved/helped the issue. The model has so far ran for ~5k batches with it set, and 2 batches without. I’ll have to run the model overnight though to confirm that it’s stable, as I’ve also had another GAN model running without instance normalization for a few hours before freezing. Try setting that environment variable on your end to see if it helps you. Here’s the code needed to be appended to the top of your Python script, if you don’t know how to do it:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

Have a good day o/

aovokaitys on Nov 16, 2020

Fixed this issue by switching to pytorch. Closing the issue since I have no interest in finding the solution anymore.

jumander on Dec 2, 2020