tensorflow: Keras Model, Functional API, Multi-input, Efficient allreduce is not supported for n IndexedSlices
Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0
- Python version: Python 3.6.9
- Bazel version (if compiling from source): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: CUDA Version 10.1.243
- GPU model and memory: (8x) Tesla K80 - 11441MiB - Driver Version: 410.72
Describe the current behavior Same issue using TF v2.2.0. I am using Keras functional API to train a model with more than one input. As a simplified example geting the same problem,
import sys
import tensorflow as tf
import numpy as np
def build_model_():
input_a_size = 20
input_b_size = 4
num_classes = 2
len_embedding = 256
input_a = tf.keras.layers.Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
input_b = tf.keras.layers.Input(shape=(input_b_size,), name='input_b', dtype=np.float32)
x = tf.keras.layers.Embedding(len_embedding, 100)(input_a)
x = tf.keras.layers.Conv1D(128, 4, activation='relu')(x)
x = tf.keras.layers.MaxPooling1D(4)(x)
x = tf.keras.layers.Flatten()(x)
branch_a = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(32, activation='relu')(input_b)
branch_b = tf.keras.layers.Dense(32, activation='relu')(x)
concat = tf.keras.layers.Concatenate()([
branch_a,
branch_b,
])
x = tf.keras.layers.Dense(512, activation = 'relu')(concat)
output = tf.keras.layers.Dense(num_classes, name='output', activation='softmax')(x)
model = tf.keras.models.Model(inputs=[
input_a,
input_b,
],
outputs=[output])
return model
strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with strategy.scope():
model = build_model_()
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
y_train = True
y_train = tf.keras.utils.to_categorical(y_train, 2)
dataset = tf.data.Dataset.from_tensors(
(
{"input_a": [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]],
"input_b": [[1.], [1.], [1.], [1.]],},
{"output": y_train},
)
).repeat(1000000).batch(256)
history = model.fit(
x = dataset,
epochs=10,
verbose = 1,
)
When starting training I get this warning,
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
When training a model with 3 inputs I get, …not supported for 2 IndexedSlices, so, I’m getting
WARNING:tensorflow:Efficient allreduce is not supported for n-1 IndexedSlices
, being n
the number of inputs to the net.
The performance is not scaling using multiple GPUs. It gets slower with 2 GPUs vs 1 GPUs, and worst case using 8 GPUs.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 7
- Comments: 19 (5 by maintainers)
Any update? Should we open a new issue to track a solution?
@nikitamaia Though
MultiWorkerMirroredStrategy
won’t throw the IndexedSlices warning, it takes a long time to start training, much longer thanMirroredStrategy
.Is there any progress about this issue?
Glad to hear it! When you just have one machine
MirroredStrategy
is better tested and definitely preferred. However, for this performance issue with IndexedSlicesMultiWorkerMirroredStrategy
is a potential workaround. I can update this thread when there is a change to howMirroredStrategy
handles IndexedSlices.@r-wheeler can you open a separate issue with some more information about your model and the GPUs you’re using? Sometimes when there’s a drop like that it can be a network issue. Did you try experimenting with
MultiWorkerMirroredStrategy
?I increased a little the size of the model in the original issue to get more representative timing. Now I’m sizing the Batch_Size according to number og GPUs used.
Timing comparison:
tf.distribute.MirroredStrategy
strategy = tf.distribute.MirroredStrategy(['/gpu:0'])
strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1']) # (2 GPUs)
strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1','/gpu:2', '/gpu:3']) # (4 GPUs)
strategy = tf.distribute.MirroredStrategy() # (8 GPUs)
tf.distribute.experimental.MultiWorkerMirroredStrategy()
. Configuring number of GPUs withtf.config.set_visible_devices
. Performance Warning disappears.So, I have performance scaling with GPUs, thanks @nikitamaia .
No one cares it seems 😦