tensorflow: [Bug] TF 2.2.0rc0 fails with AMP and Horovod 0.19.1 in Keras compile & fit

With the recent changes in the Tensorflow Keras Optimizer API and Horovod. We did some testing and found that the following configuration was now broken:

  • Tensorflow 2.2.0rc0
  • Horovod 0.19.1
  • AMP + Keras Model Compile & Fit

@sanjoy @pkanwar23 could we make sure to fix this one before TF 2.2.0 gets officially published ? It’s still an RC release for now 😃

If needed you can use this docker container which contains the right set of dependency and based on the public TF2.2.0rc0 container:

docker pull born2data/tensorflow:hvd-0.19.1_tf_2.2.0rc0

Code to reproduce:

mpirun \
    -np 2 \
    -H localhost:2 \
    -bind-to none \
    -map-by slot \
    -x PATH \
    -mca pml ob1 -mca btl ^openib \
    --allow-run-as-root \
    python main.py
import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Horovod: initialize Horovod.

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

(mnist_images, mnist_labels), _ = \
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
dataset = dataset.repeat().shuffle(10000).batch(128)

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16', 128)

mnist_model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')

# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(0.001)

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(opt)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.

callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
    steps_per_epoch=500 // hvd.size(),
    verbose=1 if hvd.rank() == 0 else 0


    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
        _minimize(tape, self.optimizer, loss, self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1739 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:232 apply_gradients
        args=(grads_and_vars, name, all_reduce_sum_gradients))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:256 _apply_gradients_cross_replica  **
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/smart_cond.py:54 smart_cond
        return true_fn()
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:248 apply_fn
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:262 _apply_gradients
        name, all_reduce_sum_gradients)
    /usr/local/lib/python3.6/dist-packages/horovod/_keras/__init__.py:73 apply_gradients
        raise Exception('`apply_gradients()` was called without a call to '

    Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.

Please let me know how I can help

CC: @nluehr @reedwm @tgaddair @cliffwoolley @omalleyt12 @houtoms

Ah we commented at the same time. Yeah that would fix it but it should be set to True only if the inner optimizer defined it.

I will fix.

This is not closed until bcfc1ba6798ced6889f579644c2c79515832d098 is cherrypicked into 2.2

JFYI. @reedwm By changing the loss_scale_optimizer.py as the following can solve the problem. But we are not sure if this is just a WAR.

class LossScaleOptimizer(optimizer_v2.OptimizerV2):
 def _aggregate_gradients(self, grads_and_vars):
   return self._optimizer._aggregate_gradients(grads_and_vars)

Hmmm this is probably because LossScaleOptimizer doesn’t define _HAS_ALL_REDUCE_SUM_GRAD. We should set _HAS_ALL_REDUCE_SUM_GRAD to True if the inner optimizer has set it to true.