tensorflow: Keras Model, Functional API, Multi-input, Efficient allreduce is not supported for n IndexedSlices

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0
Python version: Python 3.6.9
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version: CUDA Version 10.1.243
GPU model and memory: (8x) Tesla K80 - 11441MiB - Driver Version: 410.72

Describe the current behavior Same issue using TF v2.2.0. I am using Keras functional API to train a model with more than one input. As a simplified example geting the same problem,

import sys
import tensorflow as tf
import numpy as np

def build_model_():

	input_a_size = 20
	input_b_size = 4
	num_classes = 2
	len_embedding = 256

	input_a = tf.keras.layers.Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
	input_b = tf.keras.layers.Input(shape=(input_b_size,), name='input_b', dtype=np.float32)

	x = tf.keras.layers.Embedding(len_embedding, 100)(input_a)
	x = tf.keras.layers.Conv1D(128, 4, activation='relu')(x)
	x = tf.keras.layers.MaxPooling1D(4)(x)
	x = tf.keras.layers.Flatten()(x)
	branch_a = tf.keras.layers.Dense(64, activation='relu')(x)

	x = tf.keras.layers.Dense(32, activation='relu')(input_b)
	branch_b = tf.keras.layers.Dense(32, activation='relu')(x)

	concat = tf.keras.layers.Concatenate()([
				                            branch_a,
				                            branch_b,
				                           ])

	x = tf.keras.layers.Dense(512, activation = 'relu')(concat)
	output = tf.keras.layers.Dense(num_classes, name='output', activation='softmax')(x)

	model = tf.keras.models.Model(inputs=[
				                          input_a,
				                          input_b,
				                         ],
				                  outputs=[output])

	return model

strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with strategy.scope():
    model = build_model_()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

y_train = True
y_train = tf.keras.utils.to_categorical(y_train, 2)

dataset = tf.data.Dataset.from_tensors(
    (
        {"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], 
         "input_b": [[1.], [1.], [1.], [1.]],}, 
        {"output": y_train},
    )
).repeat(1000000).batch(256)

history = model.fit(
    x = dataset,
    epochs=10,
    verbose = 1,
)

When starting training I get this warning, WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices When training a model with 3 inputs I get, …not supported for 2 IndexedSlices, so, I’m getting WARNING:tensorflow:Efficient allreduce is not supported for n-1 IndexedSlices, being n the number of inputs to the net.

The performance is not scaling using multiple GPUs. It gets slower with 2 GPUs vs 1 GPUs, and worst case using 8 GPUs.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 7
Comments: 19 (5 by maintainers)

Most upvoted comments

Any update? Should we open a new issue to track a solution?

jeisinge on Jun 4, 2021

@nikitamaia Same issue here with tf-2.4. MultiWorkerMirroredStrategy is optimized for distributed training on distributed computers (networked environment) MirroredStrategy is optimized for multi-gpu on a single computer Is this right?

If that’s the case, I think the issue should not be closed, as the workarround for distributed training with multi-input model is not optimized.

@nikitamaia Though MultiWorkerMirroredStrategy won’t throw the IndexedSlices warning, it takes a long time to start training, much longer than MirroredStrategy .

Is there any progress about this issue?

SysuJayce on Mar 27, 2021

Glad to hear it! When you just have one machine MirroredStrategy is better tested and definitely preferred. However, for this performance issue with IndexedSlices MultiWorkerMirroredStrategy is a potential workaround. I can update this thread when there is a change to how MirroredStrategy handles IndexedSlices.

@r-wheeler can you open a separate issue with some more information about your model and the GPUs you’re using? Sometimes when there’s a drop like that it can be a network issue. Did you try experimenting with MultiWorkerMirroredStrategy ?

nikitamaia on Aug 6, 2020

I increased a little the size of the model in the original issue to get more representative timing. Now I’m sizing the Batch_Size according to number og GPUs used.

BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

dataset = tf.data.Dataset.from_tensors(
    (
        {"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], 
         "input_b": [[1.], [1.], [1.], [1.]],}, 
        {"output": y_train},
    )
).repeat(1000000).batch(GLOBAL_BATCH_SIZE)

Timing comparison:

tf.distribute.MirroredStrategy

strategy = tf.distribute.MirroredStrategy(['/gpu:0'])

Epoch 1/10
977/977 [==============================] - 17s 17ms/step
Epoch 2/10
977/977 [==============================] - 16s 17ms/step

strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1']) # (2 GPUs)

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
489/489 [==============================] - 12s 25ms/step
Epoch 2/10
489/489 [==============================] - 12s 25ms/step

strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1','/gpu:2', '/gpu:3']) # (4 GPUs)

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
245/245 [==============================] - 11s 46ms/step
Epoch 2/10
245/245 [==============================] - 11s 46ms/step

strategy = tf.distribute.MirroredStrategy() # (8 GPUs)

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
123/123 [==============================] - 12s 95ms/step
Epoch 2/10
123/123 [==============================] - 11s 89ms/step

tf.distribute.experimental.MultiWorkerMirroredStrategy(). Configuring number of GPUs with tf.config.set_visible_devices. Performance Warning disappears.

1 GPU

Epoch 1/10
977/977 [==============================] - 17s 17ms/step
Epoch 2/10
977/977 [==============================] - 16s 17ms/step

2 GPUs

Epoch 1/10
489/489 [==============================] - 11s 23ms/step
Epoch 2/10
489/489 [==============================] - 11s 22ms/step

4 GPUs

Epoch 1/10
245/245 [==============================] - 7s 30ms/step
Epoch 2/10
245/245 [==============================] - 7s 30ms/step

8 GPUs

Epoch 1/10
123/123 [==============================] - 6s 49ms/step
Epoch 2/10
123/123 [==============================] - 5s 44ms/step

So, I have performance scaling with GPUs, thanks @nikitamaia .

ratovarius on Aug 5, 2020

No one cares it seems 😦

s4sarath on Jun 29, 2021