tensorflow: MirroredStrategy: Efficient allreduce is not supported for n IndexedSlices
Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Centos 7
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
- TensorFlow installed from (source or binary): pip
- TensorFlow version (use command below): 2.4.4
- Python version: 3.8
- Bazel version (if compiling from source): No
- GCC/Compiler version (if compiling from source): No
- CUDA/cuDNN version: cuda11.0+cudnn8
- GPU model and memory: V100+32GB
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" - TF 2.0:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
With Embedding layer, MirroredStrategy will report a warning: MirroredStrategy: Efficient allreduce is not supported for n IndexedSlices.
Trianing speed can scale up with 2-4 GPUs, but as we increase GPU devices, the training speed can not scale up anymore.
Besides, the GPU utilization is quite low if we add more GPUs (can jump from 0 ~ 100%).
I have search the Internet and find some workaround, e.g. #41898. But even if I change MirroredStrategy with MultiWorkerMirroredStrategy, the GPU utilization is quite low and the training time is nearly the same as MirroredStrategy.
Describe the expected behavior In single machine with multiple GPUs, can use MirroredStrategy to scale up training speed with IndexedSlices (Embedding Layer).
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook. Borrow from @ratovarius
import sys
import tensorflow as tf
import numpy as np
def build_model_():
input_a_size = 20
input_b_size = 4
num_classes = 2
len_embedding = 256
input_a = tf.keras.layers.Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
input_b = tf.keras.layers.Input(shape=(input_b_size,), name='input_b', dtype=np.float32)
x = tf.keras.layers.Embedding(len_embedding, 100)(input_a)
x = tf.keras.layers.Conv1D(128, 4, activation='relu')(x)
x = tf.keras.layers.MaxPooling1D(4)(x)
x = tf.keras.layers.Flatten()(x)
branch_a = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(32, activation='relu')(input_b)
branch_b = tf.keras.layers.Dense(32, activation='relu')(x)
concat = tf.keras.layers.Concatenate()([
branch_a,
branch_b,
])
x = tf.keras.layers.Dense(512, activation = 'relu')(concat)
output = tf.keras.layers.Dense(num_classes, name='output', activation='softmax')(x)
model = tf.keras.models.Model(inputs=[
input_a,
input_b,
],
outputs=[output])
return model
strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with strategy.scope():
model = build_model_()
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
y_train = True
y_train = tf.keras.utils.to_categorical(y_train, 2)
dataset = tf.data.Dataset.from_tensors(
(
{"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]],
"input_b": [[1.], [1.], [1.], [1.]],},
{"output": y_train},
)
).repeat(1000000).batch(256)
history = model.fit(
x = dataset,
epochs=10,
verbose = 1,
)
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 28 (8 by maintainers)
@byronyi @jvishnuvardhan Hi, is there any progress? What should I do to help you solve this problem?
We understand that it is not expected to have a tuning guide for every model in every scenario. However, it seems reasonable to request documentation and commentary on performance issues when using core components. One concern with TensorFlow is that the documentation might be dispersed across various sources, making it challenging to find consolidated information. This particular issue serves as an example of this concern, where the use-case is fairly straightforward (an application-specific embedding), the functions used to create the components come from the baseline TensorFlow framework, and the distribution strategy is based on TensorFlow code. Despite this, there seems to be a lack of official guidance for over two years, leaving the community with limited information on why they may encounter problems.
I would like to share what we did to resolve the embedding table training scalability problem. The root cause seems to be the sparse gradient aggregation among all GPUs. Basically, every GPU is collecting all triggered indices from all other GPUs, then deduplicating the indices and adding together gradient with same indices. When the batch size is large, this sparse gradient aggregation time will linearly increase with number of GPUs, which makes the multiple GPU training be not scalable at all.
We tried two solutions, and both seem to be working depending on table size.
Hopefully it helps.
If you are using single host, switch to mirrored strategy. Are you using NVLink/NVSwitch or PCIe for your BUS topology? It’d be helpful to set
NCCL_DEBUG=INFOand paste the NCCL log here.Btw, GPU utilization says little about the actual performance. Take a look at your GPU power consumption; if it is stuck on NCCL deadlock, the utilization could reach 100% while the power consumption stays relatively low.