tensorflow: Keras Model, Functional API, Multi-input, Efficient allreduce is not supported for n IndexedSlices

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0
  • Python version: Python 3.6.9
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: CUDA Version 10.1.243
  • GPU model and memory: (8x) Tesla K80 - 11441MiB - Driver Version: 410.72

Describe the current behavior Same issue using TF v2.2.0. I am using Keras functional API to train a model with more than one input. As a simplified example geting the same problem,

import sys
import tensorflow as tf
import numpy as np

def build_model_():

	input_a_size = 20
	input_b_size = 4
	num_classes = 2
	len_embedding = 256

	input_a = tf.keras.layers.Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
	input_b = tf.keras.layers.Input(shape=(input_b_size,), name='input_b', dtype=np.float32)

	x = tf.keras.layers.Embedding(len_embedding, 100)(input_a)
	x = tf.keras.layers.Conv1D(128, 4, activation='relu')(x)
	x = tf.keras.layers.MaxPooling1D(4)(x)
	x = tf.keras.layers.Flatten()(x)
	branch_a = tf.keras.layers.Dense(64, activation='relu')(x)

	x = tf.keras.layers.Dense(32, activation='relu')(input_b)
	branch_b = tf.keras.layers.Dense(32, activation='relu')(x)

	concat = tf.keras.layers.Concatenate()([
				                            branch_a,
				                            branch_b,
				                           ])

	x = tf.keras.layers.Dense(512, activation = 'relu')(concat)
	output = tf.keras.layers.Dense(num_classes, name='output', activation='softmax')(x)

	model = tf.keras.models.Model(inputs=[
				                          input_a,
				                          input_b,
				                         ],
				                  outputs=[output])

	return model

strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with strategy.scope():
    model = build_model_()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

y_train = True
y_train = tf.keras.utils.to_categorical(y_train, 2)

dataset = tf.data.Dataset.from_tensors(
    (
        {"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], 
         "input_b": [[1.], [1.], [1.], [1.]],}, 
        {"output": y_train},
    )
).repeat(1000000).batch(256)

history = model.fit(
    x = dataset,
    epochs=10,
    verbose = 1,
)

When starting training I get this warning, WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices When training a model with 3 inputs I get, …not supported for 2 IndexedSlices, so, I’m getting WARNING:tensorflow:Efficient allreduce is not supported for n-1 IndexedSlices, being n the number of inputs to the net.

The performance is not scaling using multiple GPUs. It gets slower with 2 GPUs vs 1 GPUs, and worst case using 8 GPUs.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 7
  • Comments: 19 (5 by maintainers)

Most upvoted comments

Any update? Should we open a new issue to track a solution?

@nikitamaia Same issue here with tf-2.4. MultiWorkerMirroredStrategy is optimized for distributed training on distributed computers (networked environment) MirroredStrategy is optimized for multi-gpu on a single computer Is this right?

If that’s the case, I think the issue should not be closed, as the workarround for distributed training with multi-input model is not optimized.

@nikitamaia Though MultiWorkerMirroredStrategy won’t throw the IndexedSlices warning, it takes a long time to start training, much longer than MirroredStrategy .

Is there any progress about this issue?

Glad to hear it! When you just have one machine MirroredStrategy is better tested and definitely preferred. However, for this performance issue with IndexedSlices MultiWorkerMirroredStrategy is a potential workaround. I can update this thread when there is a change to how MirroredStrategy handles IndexedSlices.

@r-wheeler can you open a separate issue with some more information about your model and the GPUs you’re using? Sometimes when there’s a drop like that it can be a network issue. Did you try experimenting with MultiWorkerMirroredStrategy ?

I increased a little the size of the model in the original issue to get more representative timing. Now I’m sizing the Batch_Size according to number og GPUs used.

BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

dataset = tf.data.Dataset.from_tensors(
    (
        {"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], 
         "input_b": [[1.], [1.], [1.], [1.]],}, 
        {"output": y_train},
    )
).repeat(1000000).batch(GLOBAL_BATCH_SIZE)

Timing comparison:

  1. tf.distribute.MirroredStrategy
  • strategy = tf.distribute.MirroredStrategy(['/gpu:0'])
Epoch 1/10
977/977 [==============================] - 17s 17ms/step
Epoch 2/10
977/977 [==============================] - 16s 17ms/step
  • strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1']) # (2 GPUs)
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
489/489 [==============================] - 12s 25ms/step
Epoch 2/10
489/489 [==============================] - 12s 25ms/step
  • strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1','/gpu:2', '/gpu:3']) # (4 GPUs)
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
245/245 [==============================] - 11s 46ms/step
Epoch 2/10
245/245 [==============================] - 11s 46ms/step
  • strategy = tf.distribute.MirroredStrategy() # (8 GPUs)
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
Epoch 1/10
123/123 [==============================] - 12s 95ms/step
Epoch 2/10
123/123 [==============================] - 11s 89ms/step
  1. tf.distribute.experimental.MultiWorkerMirroredStrategy(). Configuring number of GPUs with tf.config.set_visible_devices. Performance Warning disappears.
  • 1 GPU
Epoch 1/10
977/977 [==============================] - 17s 17ms/step
Epoch 2/10
977/977 [==============================] - 16s 17ms/step
  • 2 GPUs
Epoch 1/10
489/489 [==============================] - 11s 23ms/step
Epoch 2/10
489/489 [==============================] - 11s 22ms/step
  • 4 GPUs
Epoch 1/10
245/245 [==============================] - 7s 30ms/step
Epoch 2/10
245/245 [==============================] - 7s 30ms/step
  • 8 GPUs
Epoch 1/10
123/123 [==============================] - 6s 49ms/step
Epoch 2/10
123/123 [==============================] - 5s 44ms/step

So, I have performance scaling with GPUs, thanks @nikitamaia .

No one cares it seems 😦