tensorflow: MirroredStrategy() crashes with NVLinked GPUs

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Progress Linux 5+ (engywuck-backports) (Linux Debian Buster)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de 2.1.0
  • Python version: 3.7.3
  • CUDA/cuDNN version: 10.1/7.0
  • GPU model and memory: 2x Asus GeForxe RTX 2080 Ti, Compute Capability 7.5, With NVLink

Describe the current behavior Training of a ResNet with NVLink enabled crashes with following error:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal:  unhandled cuda error
         [[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
  (1) Internal:  unhandled cuda error
         [[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
         [[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_36247]

Function call stack:
distributed_function -> distributed_function

When I use cross_device_ops=tf.distribute.ReductionToOneDevice() it doesn’t crash but it’s not the optimal performance since it’s not using NCCL. The NCCL seems to work however. Check the NCCL/all_reduce_perf log below.

Describe the expected behavior Training should not crash.

Code to reproduce the issue

# -*- coding: utf-8 -*-

import numpy as np
from tensorflow.keras.applications.resnet50 import ResNet50
import tensorflow as tf
import tensorflow_datasets as tfds

LENGTH_DATASET = 17509
NUM_CLASSES = 9
IMG_SHAPE = (256, 256, 3)
BATCH_SIZE = 32


def mymap_func(features):
    return features["image"], features["label"]


AUTOTUNE = tf.data.experimental.AUTOTUNE

# create input pipeline
dataset = tfds.load(name="deep_weeds", split="train")
dataset = dataset.map(mymap_func,
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size=LENGTH_DATASET, seed=42,
                          reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=BATCH_SIZE, drop_remainder=True).repeat()
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)


# create model
img_width, img_height = 270, 270

shape, classes = (img_width, img_height, 1), 3

# strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
strategy = tf.distribute.MirroredStrategy()
print("Number of devices in strategy: {}".format(strategy.num_replicas_in_sync))

with strategy.scope():

    model = ResNet50(include_top=True,
                       weights=None,
                       input_tensor=None,
                       input_shape=IMG_SHAPE,
                       pooling=None,
                       classes=NUM_CLASSES)

    model.compile(optimizer=tf.optimizers.Adam(),
                    loss='sparse_categorical_crossentropy',
                    metrics=["accuracy"])

    train_steps = np.ceil(LENGTH_DATASET / BATCH_SIZE)
    history = model.fit(
            x=dataset,
            epochs=10,
            verbose=1,
            steps_per_epoch=train_steps,
            use_multiprocessing=False,
            workers=8)

Other info / logs Full Tensorflow Dump:

2020-02-06 13:50:44.982897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-06 13:50:44.984479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-06 13:50:46.159056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-06 13:50:46.251661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.252336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties: 
pciBusID: 0000:af:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.252374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:46.252413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:50:46.254193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-06 13:50:46.254548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-06 13:50:46.256609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-06 13:50:46.257880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-06 13:50:46.257929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:50:46.260454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-02-06 13:50:46.260872: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-02-06 13:50:46.305692: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-02-06 13:50:46.313917: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x43d2220 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-06 13:50:46.313956: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-06 13:50:46.929224: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4360be0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-06 13:50:46.929289: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-02-06 13:50:46.929335: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-02-06 13:50:46.931578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.933238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties: 
pciBusID: 0000:af:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.933319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:46.933354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:50:46.933404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-06 13:50:46.933441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-06 13:50:46.933477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-06 13:50:46.933514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-06 13:50:46.933544: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:50:46.939900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-02-06 13:50:46.939975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:47.657348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-06 13:50:47.657397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 1 
2020-02-06 13:50:47.657405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N Y 
2020-02-06 13:50:47.657411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1:   Y N 
2020-02-06 13:50:47.659222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10235 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:3b:00.0, compute capability: 7.5)
2020-02-06 13:50:47.660401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10235 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:af:00.0, compute capability: 7.5)
Number of devices in strategy: 2
Train for 548.0 steps
Epoch 1/10
2020-02-06 13:51:06.516702: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:51:08.552933: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:51:09.714280: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-02-06 13:51:11.686255: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error
2020-02-06 13:51:11.686300: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
         [[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
2020-02-06 13:51:11.686335: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
         [[Identity_2/_60]]
2020-02-06 13:51:11.686381: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
2020-02-06 13:51:11.686678: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error
  1/548 [..............................] - ETA: 2:54:10Traceback (most recent call last):
  File "workspace/gpu_tests/test_gpus.py", line 60, in <module>
    workers=8)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal:  unhandled cuda error
         [[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
  (1) Internal:  unhandled cuda error
         [[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
         [[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_36247]

Function call stack:
distributed_function -> distributed_function

2020-02-06 13:51:12.044366: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-06 13:51:12.045417: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

NVIDIA NCCL Test Dump:

workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -g 2   
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   2374 on     tf-run device  0 [0x3b] GeForce RTX 2080 Ti
#   Rank  1 Pid   2374 on     tf-run device  1 [0xaf] GeForce RTX 2080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum    14.31    0.00    0.00  0e+00    13.72    0.00    0.00  0e+00
     1048584        262146   float     sum    69.45   15.10   15.10  0e+00    69.72   15.04   15.04  0e+00
     2097160        524290   float     sum    111.0   18.89   18.89  0e+00    108.2   19.39   19.39  0e+00
     3145736        786434   float     sum    151.2   20.80   20.80  0e+00    149.1   21.10   21.10  0e+00
     4194312       1048578   float     sum    192.3   21.81   21.81  0e+00    191.3   21.93   21.93  0e+00
     5242888       1310722   float     sum    233.3   22.48   22.48  0e+00    231.2   22.68   22.68  0e+00
     6291464       1572866   float     sum    273.2   23.03   23.03  0e+00    271.2   23.20   23.20  0e+00
     7340040       1835010   float     sum    312.8   23.46   23.46  0e+00    310.1   23.67   23.67  0e+00
     8388616       2097154   float     sum    333.5   25.16   25.16  0e+00    327.8   25.59   25.59  0e+00
     9437192       2359298   float     sum    379.4   24.88   24.88  0e+00    377.8   24.98   24.98  0e+00
    10485768       2621442   float     sum    417.6   25.11   25.11  0e+00    416.1   25.20   25.20  0e+00
    11534344       2883586   float     sum    439.1   26.27   26.27  0e+00    437.6   26.36   26.36  0e+00
    12582920       3145730   float     sum    492.7   25.54   25.54  0e+00    491.0   25.63   25.63  0e+00
    13631496       3407874   float     sum    490.5   27.79   27.79  0e+00    480.3   28.38   28.38  0e+00
    14680072       3670018   float     sum    495.9   29.60   29.60  0e+00    491.4   29.87   29.87  0e+00
    15728648       3932162   float     sum    526.7   29.86   29.86  0e+00    525.1   29.95   29.95  0e+00
    16777224       4194306   float     sum    549.9   30.51   30.51  0e+00    546.8   30.68   30.68  0e+00
    17825800       4456450   float     sum    579.1   30.78   30.78  0e+00    578.4   30.82   30.82  0e+00
    18874376       4718594   float     sum    631.1   29.90   29.90  0e+00    629.7   29.98   29.98  0e+00
    19922952       4980738   float     sum    664.7   29.97   29.97  0e+00    661.5   30.12   30.12  0e+00
    20971528       5242882   float     sum    699.4   29.98   29.98  0e+00    699.0   30.00   30.00  0e+00
    22020104       5505026   float     sum    757.4   29.07   29.07  0e+00    754.8   29.17   29.17  0e+00
    23068680       5767170   float     sum    718.0   32.13   32.13  0e+00    717.8   32.14   32.14  0e+00
    24117256       6029314   float     sum    777.1   31.04   31.04  0e+00    775.5   31.10   31.10  0e+00
    25165832       6291458   float     sum    807.1   31.18   31.18  0e+00    805.2   31.26   31.26  0e+00
    26214408       6553602   float     sum    838.5   31.26   31.26  0e+00    836.8   31.33   31.33  0e+00
    27262984       6815746   float     sum    871.7   31.27   31.27  0e+00    870.8   31.31   31.31  0e+00
    28311560       7077890   float     sum    934.3   30.30   30.30  0e+00    931.6   30.39   30.39  0e+00
    29360136       7340034   float     sum    934.6   31.41   31.41  0e+00    934.2   31.43   31.43  0e+00
    30408712       7602178   float     sum    968.5   31.40   31.40  0e+00    965.5   31.50   31.50  0e+00
    31457288       7864322   float     sum   1035.0   30.39   30.39  0e+00   1032.3   30.47   30.47  0e+00
    32505864       8126466   float     sum   1102.1   29.50   29.50  0e+00   1099.9   29.55   29.55  0e+00
    33554440       8388610   float     sum    963.5   34.83   34.83  0e+00    960.3   34.94   34.94  0e+00
    34603016       8650754   float     sum    989.6   34.97   34.97  0e+00    987.8   35.03   35.03  0e+00
    35651592       8912898   float     sum   1055.1   33.79   33.79  0e+00   1054.7   33.80   33.80  0e+00
    36700168       9175042   float     sum   1163.0   31.56   31.56  0e+00   1158.4   31.68   31.68  0e+00
    37748744       9437186   float     sum   1155.9   32.66   32.66  0e+00   1152.5   32.76   32.76  0e+00
    38797320       9699330   float     sum   1185.6   32.72   32.72  0e+00   1183.4   32.78   32.78  0e+00
    39845896       9961474   float     sum   1261.6   31.58   31.58  0e+00   1259.5   31.64   31.64  0e+00
    40894472      10223618   float     sum   1206.2   33.90   33.90  0e+00   1204.0   33.97   33.97  0e+00
    41943048      10485762   float     sum   1235.5   33.95   33.95  0e+00   1233.4   34.01   34.01  0e+00
    42991624      10747906   float     sum   1310.8   32.80   32.80  0e+00   1307.8   32.87   32.87  0e+00
    44040200      11010050   float     sum   1343.2   32.79   32.79  0e+00   1339.9   32.87   32.87  0e+00
    45088776      11272194   float     sum   1376.5   32.76   32.76  0e+00   1373.3   32.83   32.83  0e+00
    46137352      11534338   float     sum   1406.1   32.81   32.81  0e+00   1403.5   32.87   32.87  0e+00
    47185928      11796482   float     sum   1386.1   34.04   34.04  0e+00   1382.8   34.12   34.12  0e+00
    48234504      12058626   float     sum   1418.1   34.01   34.01  0e+00   1415.0   34.09   34.09  0e+00
    49283080      12320770   float     sum   1498.5   32.89   32.89  0e+00   1494.7   32.97   32.97  0e+00
    50331656      12582914   float     sum   1482.1   33.96   33.96  0e+00   1478.4   34.05   34.05  0e+00
    51380232      12845058   float     sum   1507.4   34.08   34.08  0e+00   1505.2   34.14   34.14  0e+00
    52428808      13107202   float     sum   1536.0   34.13   34.13  0e+00   1534.1   34.18   34.18  0e+00
    53477384      13369346   float     sum   1568.3   34.10   34.10  0e+00   1563.7   34.20   34.20  0e+00
    54525960      13631490   float     sum   1601.1   34.05   34.05  0e+00   1596.8   34.15   34.15  0e+00
    55574536      13893634   float     sum   1691.0   32.87   32.87  0e+00   1687.4   32.93   32.93  0e+00
    56623112      14155778   float     sum   1721.4   32.89   32.89  0e+00   1717.8   32.96   32.96  0e+00
    57671688      14417922   float     sum   1751.2   32.93   32.93  0e+00   1747.9   32.99   32.99  0e+00
    58720264      14680066   float     sum   1716.3   34.21   34.21  0e+00   1714.4   34.25   34.25  0e+00
    59768840      14942210   float     sum   1748.4   34.19   34.19  0e+00   1744.2   34.27   34.27  0e+00
    60817416      15204354   float     sum   1709.5   35.58   35.58  0e+00   1707.4   35.62   35.62  0e+00
    61865992      15466498   float     sum   1803.8   34.30   34.30  0e+00   1799.7   34.38   34.38  0e+00
    62914568      15728642   float     sum   1968.9   31.95   31.95  0e+00   1966.7   31.99   31.99  0e+00
    63963144      15990786   float     sum   2141.1   29.87   29.87  0e+00   2133.0   29.99   29.99  0e+00
    65011720      16252930   float     sum   2173.3   29.91   29.91  0e+00   2168.4   29.98   29.98  0e+00
    66060296      16515074   float     sum   2206.5   29.94   29.94  0e+00   2200.0   30.03   30.03  0e+00
    67108872      16777218   float     sum   1526.3   43.97   43.97  0e+00   1526.4   43.96   43.96  0e+00
    68157448      17039362   float     sum   1547.9   44.03   44.03  0e+00   1549.3   43.99   43.99  0e+00
    69206024      17301506   float     sum   1570.9   44.05   44.05  0e+00   1573.0   44.00   44.00  0e+00
    70254600      17563650   float     sum   1593.1   44.10   44.10  0e+00   1595.0   44.05   44.05  0e+00
    71303176      17825794   float     sum   1773.7   40.20   40.20  0e+00   1770.2   40.28   40.28  0e+00
    72351752      18087938   float     sum   1954.7   37.01   37.01  0e+00   1948.8   37.13   37.13  0e+00
    73400328      18350082   float     sum   2058.5   35.66   35.66  0e+00   2058.1   35.66   35.66  0e+00
    74448904      18612226   float     sum   2005.1   37.13   37.13  0e+00   2003.9   37.15   37.15  0e+00
    75497480      18874370   float     sum   1948.4   38.75   38.75  0e+00   1950.2   38.71   38.71  0e+00
    76546056      19136514   float     sum   1976.9   38.72   38.72  0e+00   1973.3   38.79   38.79  0e+00
    77594632      19398658   float     sum   1999.2   38.81   38.81  0e+00   1999.1   38.81   38.81  0e+00
    78643208      19660802   float     sum   2024.9   38.84   38.84  0e+00   2023.3   38.87   38.87  0e+00
    79691784      19922946   float     sum   2140.6   37.23   37.23  0e+00   2139.8   37.24   37.24  0e+00
    80740360      20185090   float     sum   2167.3   37.25   37.25  0e+00   2166.1   37.28   37.28  0e+00
    81788936      20447234   float     sum   2197.1   37.23   37.23  0e+00   2195.4   37.25   37.25  0e+00
    82837512      20709378   float     sum   2224.3   37.24   37.24  0e+00   2223.9   37.25   37.25  0e+00
    83886088      20971522   float     sum   2075.4   40.42   40.42  0e+00   2076.2   40.40   40.40  0e+00
    84934664      21233666   float     sum   2193.7   38.72   38.72  0e+00   2191.7   38.75   38.75  0e+00
    85983240      21495810   float     sum   2304.1   37.32   37.32  0e+00   2304.0   37.32   37.32  0e+00
    87031816      21757954   float     sum   2336.0   37.26   37.26  0e+00   2332.1   37.32   37.32  0e+00
    88080392      22020098   float     sum   2264.8   38.89   38.89  0e+00   2264.1   38.90   38.90  0e+00
    89128968      22282242   float     sum   2289.7   38.93   38.93  0e+00   2285.5   39.00   39.00  0e+00
    90177544      22544386   float     sum   2322.3   38.83   38.83  0e+00   2319.9   38.87   38.87  0e+00
    91226120      22806530   float     sum   2349.6   38.83   38.83  0e+00   2346.1   38.88   38.88  0e+00
    92274696      23068674   float     sum   2373.6   38.87   38.87  0e+00   2369.7   38.94   38.94  0e+00
    93323272      23330818   float     sum   2499.7   37.33   37.33  0e+00   2498.5   37.35   37.35  0e+00
    94371848      23592962   float     sum   2326.3   40.57   40.57  0e+00   2324.5   40.60   40.60  0e+00
    95420424      23855106   float     sum   2455.4   38.86   38.86  0e+00   2452.3   38.91   38.91  0e+00
    96469000      24117250   float     sum   2478.8   38.92   38.92  0e+00   2478.5   38.92   38.92  0e+00
    97517576      24379394   float     sum   2398.4   40.66   40.66  0e+00   2402.4   40.59   40.59  0e+00
    98566152      24641538   float     sum   2635.1   37.41   37.41  0e+00   2630.0   37.48   37.48  0e+00
    99614728      24903682   float     sum   2769.1   35.97   35.97  0e+00   2766.5   36.01   36.01  0e+00
   100663304      25165826   float     sum   2253.1   44.68   44.68  0e+00   2253.7   44.67   44.67  0e+00
   101711880      25427970   float     sum   2276.9   44.67   44.67  0e+00   2274.8   44.71   44.71  0e+00
   102760456      25690114   float     sum   2411.1   42.62   42.62  0e+00   2410.8   42.62   42.62  0e+00
   103809032      25952258   float     sum   2546.9   40.76   40.76  0e+00   2548.4   40.73   40.73  0e+00
   104857608      26214402   float     sum   2569.5   40.81   40.81  0e+00   2573.8   40.74   40.74  0e+00
   105906184      26476546   float     sum   2486.3   42.60   42.60  0e+00   2483.1   42.65   42.65  0e+00
   106954760      26738690   float     sum   2624.7   40.75   40.75  0e+00   2625.0   40.75   40.75  0e+00
   108003336      27000834   float     sum   2649.5   40.76   40.76  0e+00   2647.9   40.79   40.79  0e+00
   109051912      27262978   float     sum   2553.7   42.70   42.70  0e+00   2553.3   42.71   42.71  0e+00
   110100488      27525122   float     sum   2691.2   40.91   40.91  0e+00   2687.1   40.97   40.97  0e+00
   111149064      27787266   float     sum   2837.8   39.17   39.17  0e+00   2836.9   39.18   39.18  0e+00
   112197640      28049410   float     sum   2506.7   44.76   44.76  0e+00   2508.9   44.72   44.72  0e+00
   113246216      28311554   float     sum   2655.0   42.65   42.65  0e+00   2654.6   42.66   42.66  0e+00
   114294792      28573698   float     sum   2676.8   42.70   42.70  0e+00   2675.6   42.72   42.72  0e+00
   115343368      28835842   float     sum   2697.5   42.76   42.76  0e+00   2689.4   42.89   42.89  0e+00
   116391944      29097986   float     sum   2842.8   40.94   40.94  0e+00   2846.4   40.89   40.89  0e+00
   117440520      29360130   float     sum   2621.1   44.81   44.81  0e+00   2618.9   44.84   44.84  0e+00
   118489096      29622274   float     sum   2777.0   42.67   42.67  0e+00   2774.3   42.71   42.71  0e+00
   119537672      29884418   float     sum   2795.5   42.76   42.76  0e+00   2796.7   42.74   42.74  0e+00
   120586248      30146562   float     sum   2946.1   40.93   40.93  0e+00   2945.9   40.93   40.93  0e+00
   121634824      30408706   float     sum   2712.3   44.85   44.85  0e+00   2714.7   44.81   44.81  0e+00
   122683400      30670850   float     sum   2861.5   42.87   42.87  0e+00   2865.6   42.81   42.81  0e+00
   123731976      30932994   float     sum   2749.9   45.00   45.00  0e+00   2752.6   44.95   44.95  0e+00
   124780552      31195138   float     sum   2778.3   44.91   44.91  0e+00   2779.9   44.89   44.89  0e+00
   125829128      31457282   float     sum   2797.9   44.97   44.97  0e+00   2796.1   45.00   45.00  0e+00
   126877704      31719426   float     sum   2822.8   44.95   44.95  0e+00   2823.7   44.93   44.93  0e+00
   127926280      31981570   float     sum   2838.3   45.07   45.07  0e+00   2845.2   44.96   44.96  0e+00
   128974856      32243714   float     sum   2862.4   45.06   45.06  0e+00   2864.9   45.02   45.02  0e+00
   130023432      32505858   float     sum   2887.1   45.04   45.04  0e+00   2891.1   44.97   44.97  0e+00
   131072008      32768002   float     sum   2907.2   45.08   45.08  0e+00   2913.2   44.99   44.99  0e+00
   132120584      33030146   float     sum   2931.8   45.07   45.07  0e+00   2937.5   44.98   44.98  0e+00
   133169160      33292290   float     sum   2955.3   45.06   45.06  0e+00   2959.4   45.00   45.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 35.5133 
#

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (3 by maintainers)

Commits related to this issue

Most upvoted comments

@Yannik1337 I was facing the same issue. But below worked for multiple gpus strategy = tf.distribute.MirroredStrategy(devices = gpu_devices_list, cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()).

In my case, setting environment variable “TF_FORCE_GPU_ALLOW_GROWTH=true” can train the model without crash.

The log with NCCL_DEBUG = INFO suggests it’s out of memory.

i let it run with NCCL_DEBUG = INFO it gives following extra info:

tf-run:2950:3375 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,ffff0000
tf-run:2950:3374 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
tf-run:2950:3374 [0] NCCL INFO Channel 00 :    0   1
tf-run:2950:3374 [0] NCCL INFO Channel 01 :    0   1
tf-run:2950:3374 [0] NCCL INFO Channel 02 :    0   1
tf-run:2950:3374 [0] NCCL INFO Channel 03 :    0   1
tf-run:2950:3374 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
tf-run:2950:3375 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
tf-run:2950:3374 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/direct pointer
tf-run:2950:3375 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/direct pointer

tf-run:2950:3374 [0] bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/alloc.h:40 NCCL WARN Cuda failure 'out of memory'
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/transport/p2p.cc:521 -> 1
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/init.cc:339 -> 1
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/init.cc:649 -> 1
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/init.cc:814 -> 1
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/init.cc:950 -> 1
tf-run:2950:3374 [0] NCCL INFO external/nccl_archive/src/misc/group.cc:69 -> 1 [Async thread]
2020-02-06 14:10:45.228124: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error
2020-02-06 14:10:45.228167: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
         [[Identity_2/_60]]
2020-02-06 14:10:45.228204: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
         [[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
2020-02-06 14:10:45.228228: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
         [[{{node Adam/NcclAllReduce}}]]
2020-02-06 14:10:45.228328: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error

tf-run:2950:3375 [1] bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/alloc.h:40 NCCL WARN Cuda failure 'out of memory'
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/transport/p2p.cc:521 -> 1
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/init.cc:339 -> 1
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/init.cc:649 -> 1
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/init.cc:814 -> 1
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/init.cc:950 -> 1
tf-run:2950:3375 [1] NCCL INFO external/nccl_archive/src/misc/group.cc:69 -> 1 [Async thread]