tensorflow: TensorFlow 2.13 distributed training fail

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

2.13.0

Custom code

No

OS platform and distribution

Linux Ubuntu 20.04.3

Mobile device

Linux Ubuntu 20.04.3

Python version

3.8.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

CUDA 11.7, cuDNN 8.6

GPU model and memory

3x NVIDIA GeForce RTX 3090

Current behavior?

When trying to run multiple distributed trainings one after another, one of them fails with an Collective ops is aborted by: ... error.

The reproducer attached to this issue produces the following error:

Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]

When run with TF 2.12 there is no such error.

The original code where I have encountered this problem results in

E                                           Collective ops is aborted by: Shape mismatch in the collective instance 100. Op at device /job:localhost/replica:0/task:0/device:GPU:1 expected shape [517169] but another member in the group expected shape [516734]. This is likely due to different input shapes at different members of the collective op.
E                                           The error could be from a previous operation. Restart your program to reset.
E                                           	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_49105]

but I wasn’t able to reproduce this with a small code snippet.

Standalone code to reproduce the issue

import pytest
import tensorflow as tf
import tensorflow_datasets as tfds


@pytest.mark.parametrize("devices", [1, 3, 2])
def test_distributed_fit(devices):
    datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
    mnist_train, mnist_test = datasets['train'], datasets['test']

    if devices == 1:
        strategy = tf.distribute.OneDeviceStrategy("/gpu:0")
    else:
        strategy = tf.distribute.MirroredStrategy([f"/gpu:{i}" for i in range(devices)])

    batch_size = 64 * strategy.num_replicas_in_sync
    train_dataset = mnist_test.cache().shuffle(10000).batch(batch_size)

    with strategy.scope():
        model = tf.keras.Sequential([
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10)
        ])

        model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                      optimizer=tf.keras.optimizers.Adam(),
                      metrics=['accuracy'])

    model.fit(train_dataset, epochs=1)


if __name__ == '__main__':
    test_distributed_fit(1)
    test_distributed_fit(3)
    test_distributed_fit(2)

Relevant log output

/home/nsavel/venvs/nncf_tf_213/bin/python /home/nsavel/workspace/nncf_tf_213/reproducer.py 
2023-07-18 16:47:21.693862: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-18 16:47:21.722428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:7630] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-18 16:47:21.722456: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-18 16:47:21.722481: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-18 16:47:21.728124: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-18 16:47:22.211027: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-18 16:47:24.321508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22292 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:17:00.0, compute capability: 8.6
2023-07-18 16:47:24.322042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22292 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2023-07-18 16:47:24.322425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22292 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b3:00.0, compute capability: 8.6
2023-07-18 16:47:24.602273: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:25.946425: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcf358b4470 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-18 16:47:25.946450: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946455: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946458: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.950178: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-18 16:47:26.074588: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:26.171621: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
157/157 [==============================] - 2s 5ms/step - loss: 25.9054 - accuracy: 0.6873
2023-07-18 16:47:27.474184: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:30.690312: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:30.822607: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
53/53 [==============================] - 3s 7ms/step - loss: 43.9234 - accuracy: 0.5655
2023-07-18 16:47:31.372876: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:32.398894: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
2023-07-18 16:47:32.398950: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7416489994643074752
2023-07-18 16:47:32.399024: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1224112818691547746
2023-07-18 16:47:32.399044: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10338356286700713842
2023-07-18 16:47:32.399063: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6809993284794892577
2023-07-18 16:47:32.399081: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 12460047264292639245
2023-07-18 16:47:32.399097: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8051515006773529005
Traceback (most recent call last):
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)
  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node CollectiveReduceV2 defined at (most recent call last):
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
    outputs = model.distribute_strategy.run(run_step, args=(data,))

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
    outputs = model.distribute_strategy.run(run_step, args=(data,))

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/optimizers/utils.py", line 175, in _all_reduce_sum_fn
    return distribution.extended.batch_reduce_to(

Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]

Process finished with exit code 1

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 3
  • Comments: 15 (4 by maintainers)

Most upvoted comments

Upgrade the NVIDIA driver >= 545 and the issue should be addressed

Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size 2, but that group has size 3 (group_key=1) means that when you are running the function with 2 GPUs, the collective op from previous function call with 3 GPUs might still be pending.

Generally it’s not a good idea to create multiple tf.dist.Strategy in sequence in a production job, as they will share the same collective key and is very likely to cause arbitrary collapse between multiple all-reduces. For this case, try to reset context at the beginning of each test case. Example: https://github.com/tensorflow/tensorflow/blob/2a7efd891d3b16ef82b462d76fd9e61d111bf901/tensorflow/python/distribute/mirrored_strategy_test.py#L355

@SuryanarayanaY Thanks for reaching out! I used mnist_test intentionally to slightly speed up the reproduction.

I agree with your results. For me, when order of devices is set to 1, 2, 3, the case devices=2 also hangs as you describe. For the order 1, 3, 2, the case devices=2 produces the error I’ve attached in the ticket.

Since the machine you run the code on has 4 GPUs, I would suppose that setting the order to something like 1, 4, 3, 2 would also lead to the error I attached.

Anyway, I would assume that these two problems (hanging and throwing error) are related and may have the same cause.