tensorflow: MirroredStrategy: Efficient allreduce is not supported for n IndexedSlices

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Centos 7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 2.4.4
  • Python version: 3.8
  • Bazel version (if compiling from source): No
  • GCC/Compiler version (if compiling from source): No
  • CUDA/cuDNN version: cuda11.0+cudnn8
  • GPU model and memory: V100+32GB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior With Embedding layer, MirroredStrategy will report a warning: MirroredStrategy: Efficient allreduce is not supported for n IndexedSlices. Trianing speed can scale up with 2-4 GPUs, but as we increase GPU devices, the training speed can not scale up anymore. Besides, the GPU utilization is quite low if we add more GPUs (can jump from 0 ~ 100%).

I have search the Internet and find some workaround, e.g. #41898. But even if I change MirroredStrategy with MultiWorkerMirroredStrategy, the GPU utilization is quite low and the training time is nearly the same as MirroredStrategy.

Describe the expected behavior In single machine with multiple GPUs, can use MirroredStrategy to scale up training speed with IndexedSlices (Embedding Layer).

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook. Borrow from @ratovarius

import sys
import tensorflow as tf
import numpy as np

def build_model_():

	input_a_size = 20
	input_b_size = 4
	num_classes = 2
	len_embedding = 256

	input_a = tf.keras.layers.Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
	input_b = tf.keras.layers.Input(shape=(input_b_size,), name='input_b', dtype=np.float32)

	x = tf.keras.layers.Embedding(len_embedding, 100)(input_a)
	x = tf.keras.layers.Conv1D(128, 4, activation='relu')(x)
	x = tf.keras.layers.MaxPooling1D(4)(x)
	x = tf.keras.layers.Flatten()(x)
	branch_a = tf.keras.layers.Dense(64, activation='relu')(x)

	x = tf.keras.layers.Dense(32, activation='relu')(input_b)
	branch_b = tf.keras.layers.Dense(32, activation='relu')(x)

	concat = tf.keras.layers.Concatenate()([
				                            branch_a,
				                            branch_b,
				                           ])

	x = tf.keras.layers.Dense(512, activation = 'relu')(concat)
	output = tf.keras.layers.Dense(num_classes, name='output', activation='softmax')(x)

	model = tf.keras.models.Model(inputs=[
				                          input_a,
				                          input_b,
				                         ],
				                  outputs=[output])

	return model

strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with strategy.scope():
    model = build_model_()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

y_train = True
y_train = tf.keras.utils.to_categorical(y_train, 2)

dataset = tf.data.Dataset.from_tensors(
    (
        {"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], 
         "input_b": [[1.], [1.], [1.], [1.]],}, 
        {"output": y_train},
    )
).repeat(1000000).batch(256)

history = model.fit(
    x = dataset,
    epochs=10,
    verbose = 1,
)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 28 (8 by maintainers)

Most upvoted comments

@byronyi @jvishnuvardhan Hi, is there any progress? What should I do to help you solve this problem?

  1. Can you please confirm the accuracy of these statements? “The root cause seems to be the sparse gradient aggregation among all GPUs. Basically, every GPU is collecting all triggered indices from all other GPUs, then deduplicating the indices and adding together gradient with same indices. When the batch size is large, this sparse gradient aggregation time will linearly increase with number of GPUs, which makes the multiple GPU training be not scalable at all.”
  2. Would you recommend the solutions proposed by lly-zero-one?
  3. Are there alternative solutions, other than those suggested by lly-zero-one, that we could take into consideration?

We understand that it is not expected to have a tuning guide for every model in every scenario. However, it seems reasonable to request documentation and commentary on performance issues when using core components. One concern with TensorFlow is that the documentation might be dispersed across various sources, making it challenging to find consolidated information. This particular issue serves as an example of this concern, where the use-case is fairly straightforward (an application-specific embedding), the functions used to create the components come from the baseline TensorFlow framework, and the distribution strategy is based on TensorFlow code. Despite this, there seems to be a lack of official guidance for over two years, leaving the community with limited information on why they may encounter problems.

I would like to share what we did to resolve the embedding table training scalability problem. The root cause seems to be the sparse gradient aggregation among all GPUs. Basically, every GPU is collecting all triggered indices from all other GPUs, then deduplicating the indices and adding together gradient with same indices. When the batch size is large, this sparse gradient aggregation time will linearly increase with number of GPUs, which makes the multiple GPU training be not scalable at all.

We tried two solutions, and both seem to be working depending on table size.

  1. if the embedding table is not large, we can actually treat it as dense and circulate the gradient among all GPUs with all-reduce. To achieve this, we could use horovod and in the optimizer constructor, there is one flag “sparse_as_dense”, which can be set to true.
  2. we can also do the table sharding by using Nvidia HugeCTR and rely on collective communication(all2all,or reduce_scatter and etc depending on the sharding strategy) to distribute the embedding lookup and gradient update among all GPUs.

Hopefully it helps.

If you are using single host, switch to mirrored strategy. Are you using NVLink/NVSwitch or PCIe for your BUS topology? It’d be helpful to set NCCL_DEBUG=INFO and paste the NCCL log here.

Btw, GPU utilization says little about the actual performance. Take a look at your GPU power consumption; if it is stuck on NCCL deadlock, the utilization could reach 100% while the power consumption stays relatively low.