tensorflow: MirroredStrategy() crashes with NVLinked GPUs
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Progress Linux 5+ (engywuck-backports) (Linux Debian Buster)
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de 2.1.0
- Python version: 3.7.3
- CUDA/cuDNN version: 10.1/7.0
- GPU model and memory: 2x Asus GeForxe RTX 2080 Ti, Compute Capability 7.5, With NVLink
Describe the current behavior Training of a ResNet with NVLink enabled crashes with following error:
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: unhandled cuda error
[[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
(1) Internal: unhandled cuda error
[[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_36247]
Function call stack:
distributed_function -> distributed_function
When I use cross_device_ops=tf.distribute.ReductionToOneDevice() it doesn’t crash but it’s not the optimal performance since it’s not using NCCL. The NCCL seems to work however. Check the NCCL/all_reduce_perf log below.
Describe the expected behavior Training should not crash.
Code to reproduce the issue
# -*- coding: utf-8 -*-
import numpy as np
from tensorflow.keras.applications.resnet50 import ResNet50
import tensorflow as tf
import tensorflow_datasets as tfds
LENGTH_DATASET = 17509
NUM_CLASSES = 9
IMG_SHAPE = (256, 256, 3)
BATCH_SIZE = 32
def mymap_func(features):
return features["image"], features["label"]
AUTOTUNE = tf.data.experimental.AUTOTUNE
# create input pipeline
dataset = tfds.load(name="deep_weeds", split="train")
dataset = dataset.map(mymap_func,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.cache()
dataset = dataset.shuffle(buffer_size=LENGTH_DATASET, seed=42,
reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=BATCH_SIZE, drop_remainder=True).repeat()
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
# create model
img_width, img_height = 270, 270
shape, classes = (img_width, img_height, 1), 3
# strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
strategy = tf.distribute.MirroredStrategy()
print("Number of devices in strategy: {}".format(strategy.num_replicas_in_sync))
with strategy.scope():
model = ResNet50(include_top=True,
weights=None,
input_tensor=None,
input_shape=IMG_SHAPE,
pooling=None,
classes=NUM_CLASSES)
model.compile(optimizer=tf.optimizers.Adam(),
loss='sparse_categorical_crossentropy',
metrics=["accuracy"])
train_steps = np.ceil(LENGTH_DATASET / BATCH_SIZE)
history = model.fit(
x=dataset,
epochs=10,
verbose=1,
steps_per_epoch=train_steps,
use_multiprocessing=False,
workers=8)
Other info / logs Full Tensorflow Dump:
2020-02-06 13:50:44.982897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-06 13:50:44.984479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-06 13:50:46.159056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-06 13:50:46.251661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:3b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.252336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:af:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.252374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:46.252413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:50:46.254193: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-06 13:50:46.254548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-06 13:50:46.256609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-06 13:50:46.257880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-06 13:50:46.257929: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:50:46.260454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-02-06 13:50:46.260872: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-02-06 13:50:46.305692: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-02-06 13:50:46.313917: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x43d2220 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-06 13:50:46.313956: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-06 13:50:46.929224: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4360be0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-06 13:50:46.929289: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-02-06 13:50:46.929335: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-02-06 13:50:46.931578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:3b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.933238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:af:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-02-06 13:50:46.933319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:46.933354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:50:46.933404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-06 13:50:46.933441: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-06 13:50:46.933477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-06 13:50:46.933514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-06 13:50:46.933544: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:50:46.939900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
2020-02-06 13:50:46.939975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-06 13:50:47.657348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-06 13:50:47.657397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0 1
2020-02-06 13:50:47.657405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N Y
2020-02-06 13:50:47.657411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 1: Y N
2020-02-06 13:50:47.659222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10235 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:3b:00.0, compute capability: 7.5)
2020-02-06 13:50:47.660401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10235 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:af:00.0, compute capability: 7.5)
Number of devices in strategy: 2
Train for 548.0 steps
Epoch 1/10
2020-02-06 13:51:06.516702: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-06 13:51:08.552933: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-06 13:51:09.714280: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-02-06 13:51:11.686255: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error
2020-02-06 13:51:11.686300: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
[[{{node Adam/NcclAllReduce}}]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
2020-02-06 13:51:11.686335: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
[[{{node Adam/NcclAllReduce}}]]
[[Identity_2/_60]]
2020-02-06 13:51:11.686381: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: unhandled cuda error
[[{{node Adam/NcclAllReduce}}]]
2020-02-06 13:51:11.686678: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: unhandled cuda error
1/548 [..............................] - ETA: 2:54:10Traceback (most recent call last):
File "workspace/gpu_tests/test_gpus.py", line 60, in <module>
workers=8)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
total_epochs=epochs)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
return self._stateless_fn(*args, **kwds)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
self.captured_inputs)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/sam2/workspace/python_venvs/tf-run/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: unhandled cuda error
[[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
(1) Internal: unhandled cuda error
[[node Adam/NcclAllReduce (defined at workspace/gpu_tests/test_gpus.py:60) ]]
[[GroupCrossDeviceControlEdges_0/Adam/Adam/update_1_1/Const/_39]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_36247]
Function call stack:
distributed_function -> distributed_function
2020-02-06 13:51:12.044366: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-02-06 13:51:12.045417: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
NVIDIA NCCL Test Dump:
workspace/nccl-tests/build/all_reduce_perf -b 8 -e 128M -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 2374 on tf-run device 0 [0x3b] GeForce RTX 2080 Ti
# Rank 1 Pid 2374 on tf-run device 1 [0xaf] GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 14.31 0.00 0.00 0e+00 13.72 0.00 0.00 0e+00
1048584 262146 float sum 69.45 15.10 15.10 0e+00 69.72 15.04 15.04 0e+00
2097160 524290 float sum 111.0 18.89 18.89 0e+00 108.2 19.39 19.39 0e+00
3145736 786434 float sum 151.2 20.80 20.80 0e+00 149.1 21.10 21.10 0e+00
4194312 1048578 float sum 192.3 21.81 21.81 0e+00 191.3 21.93 21.93 0e+00
5242888 1310722 float sum 233.3 22.48 22.48 0e+00 231.2 22.68 22.68 0e+00
6291464 1572866 float sum 273.2 23.03 23.03 0e+00 271.2 23.20 23.20 0e+00
7340040 1835010 float sum 312.8 23.46 23.46 0e+00 310.1 23.67 23.67 0e+00
8388616 2097154 float sum 333.5 25.16 25.16 0e+00 327.8 25.59 25.59 0e+00
9437192 2359298 float sum 379.4 24.88 24.88 0e+00 377.8 24.98 24.98 0e+00
10485768 2621442 float sum 417.6 25.11 25.11 0e+00 416.1 25.20 25.20 0e+00
11534344 2883586 float sum 439.1 26.27 26.27 0e+00 437.6 26.36 26.36 0e+00
12582920 3145730 float sum 492.7 25.54 25.54 0e+00 491.0 25.63 25.63 0e+00
13631496 3407874 float sum 490.5 27.79 27.79 0e+00 480.3 28.38 28.38 0e+00
14680072 3670018 float sum 495.9 29.60 29.60 0e+00 491.4 29.87 29.87 0e+00
15728648 3932162 float sum 526.7 29.86 29.86 0e+00 525.1 29.95 29.95 0e+00
16777224 4194306 float sum 549.9 30.51 30.51 0e+00 546.8 30.68 30.68 0e+00
17825800 4456450 float sum 579.1 30.78 30.78 0e+00 578.4 30.82 30.82 0e+00
18874376 4718594 float sum 631.1 29.90 29.90 0e+00 629.7 29.98 29.98 0e+00
19922952 4980738 float sum 664.7 29.97 29.97 0e+00 661.5 30.12 30.12 0e+00
20971528 5242882 float sum 699.4 29.98 29.98 0e+00 699.0 30.00 30.00 0e+00
22020104 5505026 float sum 757.4 29.07 29.07 0e+00 754.8 29.17 29.17 0e+00
23068680 5767170 float sum 718.0 32.13 32.13 0e+00 717.8 32.14 32.14 0e+00
24117256 6029314 float sum 777.1 31.04 31.04 0e+00 775.5 31.10 31.10 0e+00
25165832 6291458 float sum 807.1 31.18 31.18 0e+00 805.2 31.26 31.26 0e+00
26214408 6553602 float sum 838.5 31.26 31.26 0e+00 836.8 31.33 31.33 0e+00
27262984 6815746 float sum 871.7 31.27 31.27 0e+00 870.8 31.31 31.31 0e+00
28311560 7077890 float sum 934.3 30.30 30.30 0e+00 931.6 30.39 30.39 0e+00
29360136 7340034 float sum 934.6 31.41 31.41 0e+00 934.2 31.43 31.43 0e+00
30408712 7602178 float sum 968.5 31.40 31.40 0e+00 965.5 31.50 31.50 0e+00
31457288 7864322 float sum 1035.0 30.39 30.39 0e+00 1032.3 30.47 30.47 0e+00
32505864 8126466 float sum 1102.1 29.50 29.50 0e+00 1099.9 29.55 29.55 0e+00
33554440 8388610 float sum 963.5 34.83 34.83 0e+00 960.3 34.94 34.94 0e+00
34603016 8650754 float sum 989.6 34.97 34.97 0e+00 987.8 35.03 35.03 0e+00
35651592 8912898 float sum 1055.1 33.79 33.79 0e+00 1054.7 33.80 33.80 0e+00
36700168 9175042 float sum 1163.0 31.56 31.56 0e+00 1158.4 31.68 31.68 0e+00
37748744 9437186 float sum 1155.9 32.66 32.66 0e+00 1152.5 32.76 32.76 0e+00
38797320 9699330 float sum 1185.6 32.72 32.72 0e+00 1183.4 32.78 32.78 0e+00
39845896 9961474 float sum 1261.6 31.58 31.58 0e+00 1259.5 31.64 31.64 0e+00
40894472 10223618 float sum 1206.2 33.90 33.90 0e+00 1204.0 33.97 33.97 0e+00
41943048 10485762 float sum 1235.5 33.95 33.95 0e+00 1233.4 34.01 34.01 0e+00
42991624 10747906 float sum 1310.8 32.80 32.80 0e+00 1307.8 32.87 32.87 0e+00
44040200 11010050 float sum 1343.2 32.79 32.79 0e+00 1339.9 32.87 32.87 0e+00
45088776 11272194 float sum 1376.5 32.76 32.76 0e+00 1373.3 32.83 32.83 0e+00
46137352 11534338 float sum 1406.1 32.81 32.81 0e+00 1403.5 32.87 32.87 0e+00
47185928 11796482 float sum 1386.1 34.04 34.04 0e+00 1382.8 34.12 34.12 0e+00
48234504 12058626 float sum 1418.1 34.01 34.01 0e+00 1415.0 34.09 34.09 0e+00
49283080 12320770 float sum 1498.5 32.89 32.89 0e+00 1494.7 32.97 32.97 0e+00
50331656 12582914 float sum 1482.1 33.96 33.96 0e+00 1478.4 34.05 34.05 0e+00
51380232 12845058 float sum 1507.4 34.08 34.08 0e+00 1505.2 34.14 34.14 0e+00
52428808 13107202 float sum 1536.0 34.13 34.13 0e+00 1534.1 34.18 34.18 0e+00
53477384 13369346 float sum 1568.3 34.10 34.10 0e+00 1563.7 34.20 34.20 0e+00
54525960 13631490 float sum 1601.1 34.05 34.05 0e+00 1596.8 34.15 34.15 0e+00
55574536 13893634 float sum 1691.0 32.87 32.87 0e+00 1687.4 32.93 32.93 0e+00
56623112 14155778 float sum 1721.4 32.89 32.89 0e+00 1717.8 32.96 32.96 0e+00
57671688 14417922 float sum 1751.2 32.93 32.93 0e+00 1747.9 32.99 32.99 0e+00
58720264 14680066 float sum 1716.3 34.21 34.21 0e+00 1714.4 34.25 34.25 0e+00
59768840 14942210 float sum 1748.4 34.19 34.19 0e+00 1744.2 34.27 34.27 0e+00
60817416 15204354 float sum 1709.5 35.58 35.58 0e+00 1707.4 35.62 35.62 0e+00
61865992 15466498 float sum 1803.8 34.30 34.30 0e+00 1799.7 34.38 34.38 0e+00
62914568 15728642 float sum 1968.9 31.95 31.95 0e+00 1966.7 31.99 31.99 0e+00
63963144 15990786 float sum 2141.1 29.87 29.87 0e+00 2133.0 29.99 29.99 0e+00
65011720 16252930 float sum 2173.3 29.91 29.91 0e+00 2168.4 29.98 29.98 0e+00
66060296 16515074 float sum 2206.5 29.94 29.94 0e+00 2200.0 30.03 30.03 0e+00
67108872 16777218 float sum 1526.3 43.97 43.97 0e+00 1526.4 43.96 43.96 0e+00
68157448 17039362 float sum 1547.9 44.03 44.03 0e+00 1549.3 43.99 43.99 0e+00
69206024 17301506 float sum 1570.9 44.05 44.05 0e+00 1573.0 44.00 44.00 0e+00
70254600 17563650 float sum 1593.1 44.10 44.10 0e+00 1595.0 44.05 44.05 0e+00
71303176 17825794 float sum 1773.7 40.20 40.20 0e+00 1770.2 40.28 40.28 0e+00
72351752 18087938 float sum 1954.7 37.01 37.01 0e+00 1948.8 37.13 37.13 0e+00
73400328 18350082 float sum 2058.5 35.66 35.66 0e+00 2058.1 35.66 35.66 0e+00
74448904 18612226 float sum 2005.1 37.13 37.13 0e+00 2003.9 37.15 37.15 0e+00
75497480 18874370 float sum 1948.4 38.75 38.75 0e+00 1950.2 38.71 38.71 0e+00
76546056 19136514 float sum 1976.9 38.72 38.72 0e+00 1973.3 38.79 38.79 0e+00
77594632 19398658 float sum 1999.2 38.81 38.81 0e+00 1999.1 38.81 38.81 0e+00
78643208 19660802 float sum 2024.9 38.84 38.84 0e+00 2023.3 38.87 38.87 0e+00
79691784 19922946 float sum 2140.6 37.23 37.23 0e+00 2139.8 37.24 37.24 0e+00
80740360 20185090 float sum 2167.3 37.25 37.25 0e+00 2166.1 37.28 37.28 0e+00
81788936 20447234 float sum 2197.1 37.23 37.23 0e+00 2195.4 37.25 37.25 0e+00
82837512 20709378 float sum 2224.3 37.24 37.24 0e+00 2223.9 37.25 37.25 0e+00
83886088 20971522 float sum 2075.4 40.42 40.42 0e+00 2076.2 40.40 40.40 0e+00
84934664 21233666 float sum 2193.7 38.72 38.72 0e+00 2191.7 38.75 38.75 0e+00
85983240 21495810 float sum 2304.1 37.32 37.32 0e+00 2304.0 37.32 37.32 0e+00
87031816 21757954 float sum 2336.0 37.26 37.26 0e+00 2332.1 37.32 37.32 0e+00
88080392 22020098 float sum 2264.8 38.89 38.89 0e+00 2264.1 38.90 38.90 0e+00
89128968 22282242 float sum 2289.7 38.93 38.93 0e+00 2285.5 39.00 39.00 0e+00
90177544 22544386 float sum 2322.3 38.83 38.83 0e+00 2319.9 38.87 38.87 0e+00
91226120 22806530 float sum 2349.6 38.83 38.83 0e+00 2346.1 38.88 38.88 0e+00
92274696 23068674 float sum 2373.6 38.87 38.87 0e+00 2369.7 38.94 38.94 0e+00
93323272 23330818 float sum 2499.7 37.33 37.33 0e+00 2498.5 37.35 37.35 0e+00
94371848 23592962 float sum 2326.3 40.57 40.57 0e+00 2324.5 40.60 40.60 0e+00
95420424 23855106 float sum 2455.4 38.86 38.86 0e+00 2452.3 38.91 38.91 0e+00
96469000 24117250 float sum 2478.8 38.92 38.92 0e+00 2478.5 38.92 38.92 0e+00
97517576 24379394 float sum 2398.4 40.66 40.66 0e+00 2402.4 40.59 40.59 0e+00
98566152 24641538 float sum 2635.1 37.41 37.41 0e+00 2630.0 37.48 37.48 0e+00
99614728 24903682 float sum 2769.1 35.97 35.97 0e+00 2766.5 36.01 36.01 0e+00
100663304 25165826 float sum 2253.1 44.68 44.68 0e+00 2253.7 44.67 44.67 0e+00
101711880 25427970 float sum 2276.9 44.67 44.67 0e+00 2274.8 44.71 44.71 0e+00
102760456 25690114 float sum 2411.1 42.62 42.62 0e+00 2410.8 42.62 42.62 0e+00
103809032 25952258 float sum 2546.9 40.76 40.76 0e+00 2548.4 40.73 40.73 0e+00
104857608 26214402 float sum 2569.5 40.81 40.81 0e+00 2573.8 40.74 40.74 0e+00
105906184 26476546 float sum 2486.3 42.60 42.60 0e+00 2483.1 42.65 42.65 0e+00
106954760 26738690 float sum 2624.7 40.75 40.75 0e+00 2625.0 40.75 40.75 0e+00
108003336 27000834 float sum 2649.5 40.76 40.76 0e+00 2647.9 40.79 40.79 0e+00
109051912 27262978 float sum 2553.7 42.70 42.70 0e+00 2553.3 42.71 42.71 0e+00
110100488 27525122 float sum 2691.2 40.91 40.91 0e+00 2687.1 40.97 40.97 0e+00
111149064 27787266 float sum 2837.8 39.17 39.17 0e+00 2836.9 39.18 39.18 0e+00
112197640 28049410 float sum 2506.7 44.76 44.76 0e+00 2508.9 44.72 44.72 0e+00
113246216 28311554 float sum 2655.0 42.65 42.65 0e+00 2654.6 42.66 42.66 0e+00
114294792 28573698 float sum 2676.8 42.70 42.70 0e+00 2675.6 42.72 42.72 0e+00
115343368 28835842 float sum 2697.5 42.76 42.76 0e+00 2689.4 42.89 42.89 0e+00
116391944 29097986 float sum 2842.8 40.94 40.94 0e+00 2846.4 40.89 40.89 0e+00
117440520 29360130 float sum 2621.1 44.81 44.81 0e+00 2618.9 44.84 44.84 0e+00
118489096 29622274 float sum 2777.0 42.67 42.67 0e+00 2774.3 42.71 42.71 0e+00
119537672 29884418 float sum 2795.5 42.76 42.76 0e+00 2796.7 42.74 42.74 0e+00
120586248 30146562 float sum 2946.1 40.93 40.93 0e+00 2945.9 40.93 40.93 0e+00
121634824 30408706 float sum 2712.3 44.85 44.85 0e+00 2714.7 44.81 44.81 0e+00
122683400 30670850 float sum 2861.5 42.87 42.87 0e+00 2865.6 42.81 42.81 0e+00
123731976 30932994 float sum 2749.9 45.00 45.00 0e+00 2752.6 44.95 44.95 0e+00
124780552 31195138 float sum 2778.3 44.91 44.91 0e+00 2779.9 44.89 44.89 0e+00
125829128 31457282 float sum 2797.9 44.97 44.97 0e+00 2796.1 45.00 45.00 0e+00
126877704 31719426 float sum 2822.8 44.95 44.95 0e+00 2823.7 44.93 44.93 0e+00
127926280 31981570 float sum 2838.3 45.07 45.07 0e+00 2845.2 44.96 44.96 0e+00
128974856 32243714 float sum 2862.4 45.06 45.06 0e+00 2864.9 45.02 45.02 0e+00
130023432 32505858 float sum 2887.1 45.04 45.04 0e+00 2891.1 44.97 44.97 0e+00
131072008 32768002 float sum 2907.2 45.08 45.08 0e+00 2913.2 44.99 44.99 0e+00
132120584 33030146 float sum 2931.8 45.07 45.07 0e+00 2937.5 44.98 44.98 0e+00
133169160 33292290 float sum 2955.3 45.06 45.06 0e+00 2959.4 45.00 45.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 35.5133
#
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (3 by maintainers)
@Yannik1337 I was facing the same issue. But below worked for multiple gpus
strategy = tf.distribute.MirroredStrategy(devices = gpu_devices_list, cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()).In my case, setting environment variable “TF_FORCE_GPU_ALLOW_GROWTH=true” can train the model without crash.
The log with NCCL_DEBUG = INFO suggests it’s out of memory.
i let it run with
NCCL_DEBUG=INFOit gives following extra info: