tensorflow: [Bug] TF 2.2.0rc0 fails with AMP and Horovod 0.19.1 in Keras compile & fit

With the recent changes in the Tensorflow Keras Optimizer API and Horovod. We did some testing and found that the following configuration was now broken:

  • Tensorflow 2.2.0rc0
  • Horovod 0.19.1
  • AMP + Keras Model Compile & Fit

@sanjoy @pkanwar23 could we make sure to fix this one before TF 2.2.0 gets officially published ? It’s still an RC release for now 😃

If needed you can use this docker container which contains the right set of dependency and based on the public TF2.2.0rc0 container:

docker pull born2data/tensorflow:hvd-0.19.1_tf_2.2.0rc0

Code to reproduce:

mpirun \
    -np 2 \
    -H localhost:2 \
    -bind-to none \
    -map-by slot \
    -x NCCL_DEBUG=VERSION \
    -x LD_LIBRARY_PATH \
    -x PATH \
    -mca pml ob1 -mca btl ^openib \
    --allow-run-as-root \
    python main.py
import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

(mnist_images, mnist_labels), _ = \
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16', 128)
tf.keras.mixed_precision.experimental.set_policy(policy)

mnist_model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(0.001)

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(opt)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                    optimizer=opt,
                    metrics=['accuracy'],
                    experimental_run_tf_function=False)

callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
    dataset,
    steps_per_epoch=500 // hvd.size(),
    callbacks=callbacks,
    epochs=24,
    verbose=1 if hvd.rank() == 0 else 0
)

Error:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
        _minimize(tape, self.optimizer, loss, self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1739 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:232 apply_gradients
        args=(grads_and_vars, name, all_reduce_sum_gradients))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:256 _apply_gradients_cross_replica  **
        control_flow_ops.no_op)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/smart_cond.py:54 smart_cond
        return true_fn()
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:248 apply_fn
        all_reduce_sum_gradients))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:262 _apply_gradients
        name, all_reduce_sum_gradients)
    /usr/local/lib/python3.6/dist-packages/horovod/_keras/__init__.py:73 apply_gradients
        raise Exception('`apply_gradients()` was called without a call to '

    Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.

Please let me know how I can help

CC: @nluehr @reedwm @tgaddair @cliffwoolley @omalleyt12 @houtoms

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 15 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Ah we commented at the same time. Yeah that would fix it but it should be set to True only if the inner optimizer defined it.

I will fix.

This is not closed until bcfc1ba6798ced6889f579644c2c79515832d098 is cherrypicked into 2.2

JFYI. @reedwm By changing the loss_scale_optimizer.py as the following can solve the problem. But we are not sure if this is just a WAR.

class LossScaleOptimizer(optimizer_v2.OptimizerV2):
 ...
 _HAS_ALL_REDUCE_SUM_GRAD = True
 def _aggregate_gradients(self, grads_and_vars):
   return self._optimizer._aggregate_gradients(grads_and_vars)
 ...

Hmmm this is probably because LossScaleOptimizer doesn’t define _HAS_ALL_REDUCE_SUM_GRAD. We should set _HAS_ALL_REDUCE_SUM_GRAD to True if the inner optimizer has set it to true.