tensorflow: MirroredStrategy Keras Example Hangs

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): Jupyter Nodebook of a official kubeflow image gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0 on my kubeflow platform.
TensorFlow version (use command below): 2.1.0-gpu
Python version: 3.6.9
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CUDA/cuDNN version: release 10.1, V10.1.243, but I can’t find cuDNN libraries using a command cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
GPU model and memory: T4 / 16G

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: v1.12.1-30591-g2d3828de27 2.2.0-dev20200427

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior In my notebook having 4 GPUs, I ran a MirroredStrategy Keras example documented in here and all GPUs’s memories are occupied, but after printing below logs it hangs.

logs

mkc_choi@hbseo-m$ kubectl -nhanbae-seo logs -cdemo02 demo02-0
[W 09:04:55.882 NotebookApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 09:04:56.129 NotebookApp] JupyterLab extension loaded from /usr/local/lib/python3.6/dist-packages/jupyterlab
[I 09:04:56.129 NotebookApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 09:04:56.349 NotebookApp] Serving notebooks from local directory: /home/jovyan
[I 09:04:56.349 NotebookApp] The Jupyter Notebook is running at:
[I 09:04:56.349 NotebookApp] http://demo02-0:8888/notebook/hanbae-seo/demo02/
[I 09:04:56.349 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 09:05:07.112 NotebookApp] 302 GET /notebook/hanbae-seo/demo02/ (127.0.0.1) 0.85ms
[I 09:05:12.368 NotebookApp] Creating new notebook in
[I 09:05:13.255 NotebookApp] Kernel started: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:07:14.315 NotebookApp] Saving file at /Untitled1.ipynb
[I 09:07:31.203 NotebookApp] Starting buffering for 85970d11-8ffb-4259-a0f9-29614d194712:f7966791845643a9bcb0bc02a3b60f8c
2020-04-28 09:07:46.689575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 09:07:54.871181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-28 09:07:55.100937: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.101951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:55.102094: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.103013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:55.103126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.104070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:55.104191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.105138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:55.105182: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 09:07:55.108010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-28 09:07:55.110338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-28 09:07:55.111236: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-28 09:07:55.113969: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-28 09:07:55.115693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-28 09:07:55.120921: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-28 09:07:55.121047: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.122079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.123059: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.124069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.125118: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.126190: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.127161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.128182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:55.129220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3
2020-04-28 09:07:55.130294: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-04-28 09:07:55.138816: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-04-28 09:07:55.139503: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x58629a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-28 09:07:55.139544: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-28 09:07:56.141806: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.157822: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.171327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.191060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.193694: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x208f270 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-28 09:07:56.193722: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-04-28 09:07:56.193728: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2020-04-28 09:07:56.193733: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2020-04-28 09:07:56.193738: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2020-04-28 09:07:56.199679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.201632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:56.201722: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.203759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:56.203839: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.205793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:56.205876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.207830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 09:07:56.207873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 09:07:56.207902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-28 09:07:56.207918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-28 09:07:56.207933: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-28 09:07:56.207942: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-28 09:07:56.207955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-28 09:07:56.207965: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-28 09:07:56.208030: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.209924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.211728: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.212713: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.213698: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.214697: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.215598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.216512: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:56.217462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3
2020-04-28 09:07:56.217520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 09:07:59.142161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-28 09:07:59.142208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1091]      0 1 2 3
2020-04-28 09:07:59.142216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 0:   N Y N N
2020-04-28 09:07:59.142221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 1:   Y N N N
2020-04-28 09:07:59.142226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 2:   N N N Y
2020-04-28 09:07:59.142230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 3:   N N Y N
2020-04-28 09:07:59.142584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.143574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.144696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.145672: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.146611: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.147517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13969 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-04-28 09:07:59.148310: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.149214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13969 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5)
2020-04-28 09:07:59.149788: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.150683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 13969 MB memory) -> physical GPU (device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5)
2020-04-28 09:07:59.151243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 09:07:59.152188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 13969 MB memory) -> physical GPU (device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)
2020-04-28 09:07:59.157773: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
[I 09:09:33.647 NotebookApp] Saving file at /Untitled1.ipynb
[I 09:10:40.103 NotebookApp] 302 GET /notebook/hanbae-seo/demo02/ (127.0.0.1) 0.62ms
[I 09:12:43.364 NotebookApp] Saving file at /Untitled1.ipynb
[I 09:14:42.569 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:44.638 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:46.183 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:50.198 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:52.650 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:54.470 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:55.575 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
[I 09:14:58.246 NotebookApp] Kernel interrupted: 85970d11-8ffb-4259-a0f9-29614d194712
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 1945, in shutdown
    h.flush()
  File "/usr/local/lib/python3.6/dist-packages/absl/logging/__init__.py", line 892, in flush
    self._current_handler.flush()
  File "/usr/local/lib/python3.6/dist-packages/absl/logging/__init__.py", line 785, in flush
    self.stream.flush()
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/iostream.py", line 341, in flush
    if self.pub_thread.thread.is_alive():
AttributeError: 'NoneType' object has no attribute 'thread'
[I 09:15:06.715 NotebookApp] Kernel restarted: 85970d11-8ffb-4259-a0f9-29614d194712
2020-04-28 09:15:12.082666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 09:15:42.166324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-28 09:15:42.402363: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
.
.
.
.
.
2020-04-28 11:50:42.127628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.130778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:42.130944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.135680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:42.135781: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.143098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:42.143189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.150307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:42.150351: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 11:50:42.152504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-28 11:50:42.154902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-28 11:50:42.155426: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-28 11:50:42.157576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-28 11:50:42.158808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-28 11:50:42.163515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-28 11:50:42.163622: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.170033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.173795: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.178082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.187495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.194851: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.202891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.208997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:42.213464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3
2020-04-28 11:50:42.214783: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-04-28 11:50:42.222662: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-04-28 11:50:42.223394: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d3a7c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-28 11:50:42.223423: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-28 11:50:43.541123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.558363: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.569507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.585723: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.587840: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5669ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-28 11:50:43.587865: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-04-28 11:50:43.587871: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2020-04-28 11:50:43.587876: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2020-04-28 11:50:43.587880: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2020-04-28 11:50:43.593961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.595914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:43.595983: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.597987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:43.598069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.600110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:43.600215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.602185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1544] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-04-28 11:50:43.602229: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 11:50:43.602250: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-28 11:50:43.602265: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-28 11:50:43.602275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-28 11:50:43.602287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-28 11:50:43.602296: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-28 11:50:43.602305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-28 11:50:43.602363: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.604392: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.606481: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.608546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.610742: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.612800: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.615044: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.616887: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:43.618745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1686] Adding visible gpu devices: 0, 1, 2, 3
2020-04-28 11:50:43.618810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-28 11:50:45.761140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-28 11:50:45.761202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1091]      0 1 2 3
2020-04-28 11:50:45.761210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 0:   N Y N N
2020-04-28 11:50:45.761216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 1:   Y N N N
2020-04-28 11:50:45.761220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 2:   N N N Y
2020-04-28 11:50:45.761225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] 3:   N N Y N
2020-04-28 11:50:45.761562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.762656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.763675: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.764693: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.765747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.766736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13969 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-04-28 11:50:45.767518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.768518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 13969 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5)
2020-04-28 11:50:45.769142: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.770076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 13969 MB memory) -> physical GPU (device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5)
2020-04-28 11:50:45.770664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-28 11:50:45.771659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1230] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 13969 MB memory) -> physical GPU (device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)
2020-04-28 11:50:47.091531: I tensorflow/core/profiler/lib/profiler_session.cc:154] Profiler session started.
2020-04-28 11:50:47.091683: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1372] Profiler found 4 GPUs
2020-04-28 11:50:47.093991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-04-28 11:50:47.194680: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1422] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-04-28 11:50:47.283735: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.302659: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.321059: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.341426: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.342187: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.344125: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.345834: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.347581: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.404901: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.424905: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.444686: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.463115: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.463881: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.465529: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.467173: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:47.468860: W tensorflow/core/kernels/data/captured_function.cc:458] Disabling multi-device execution for a function that uses the experimental_ints_on_device attribute.
2020-04-28 11:50:51.405795: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-28 11:50:52.668100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

Outputs generated with the example on Nodebook

WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
2.2.0-dev20200427
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
Number of devices: 4
Epoch 1/12
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 6 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

The MirroredStrategy Keras example I applied (https://www.tensorflow.org/tutorials/distribute/keras)

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os

print(tf.__version__)

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']
strategy = tf.distribute.MirroredStrategy()

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255

  return image, label

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

with strategy.scope():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
  ])

  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# Define the checkpoint directory to store the checkpoints
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# Function for decaying the learning rate.
# You can define any decay function you need.
def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5

# Callback for printing the LR at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
                                                      model.optimizer.lr.numpy()))
callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]

model.fit(train_dataset, epochs=12, callbacks=callbacks)

And GPU memories are occupied but all GPU’s GPU-Util Compute M. is 0%.

mkc_choi@hbseo-g1:~/tf-operator/examples/v1/multi$ nvidia-smi
Tue Apr 28 13:50:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    32W /  70W |  14612MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |  14612MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   41C    P0    26W /  70W |  14612MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |  14612MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24210      C   /usr/bin/python3                           14601MiB |
|    1     24210      C   /usr/bin/python3                           14601MiB |
|    2     24210      C   /usr/bin/python3                           14601MiB |
|    3     24210      C   /usr/bin/python3                           14601MiB |
+-----------------------------------------------------------------------------+

When I specify CUDA_VISIBLE_DEVICES number as below, it run, but on only one GPU.

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

And I tried to set “tf.data.experimental.AutoShardPolicy.OFF” described in an existing issue here, but the result is same. codes are below

import os
import tensorflow_datasets as tfds
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
# strategy = tf.distribute.MirroredStrategy() # NCCL vs RING
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def make_datasets_unbatched():
  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label
  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)
  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)

def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
  ])
  model.compile(
    loss=tf.keras.losses.sparse_categorical_crossentropy,
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
    metrics=['accuracy'])
  return model
GLOBAL_BATCH_SIZE = 64 * 2
with strategy.scope():
  train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE).repeat()
  options = tf.data.Options()
  options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
  train_datasets = train_datasets.with_options(options)
  multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)

Describe the expected behavior

The example runs with multiple GPUs

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (7 by maintainers)

Most upvoted comments

I was experiencing a similar issue with the MirroredStrategy example and it turned out to be the TensorBoard callback. If I commented out that callback, the example seemed to work as expected.

davidrpugh on May 11, 2020