tensorflow: [Bug] TF 2.2.0rc0 fails with AMP and Horovod 0.19.1 in Keras compile & fit

With the recent changes in the Tensorflow Keras Optimizer API and Horovod. We did some testing and found that the following configuration was now broken:

Tensorflow 2.2.0rc0
Horovod 0.19.1
AMP + Keras Model Compile & Fit

@sanjoy @pkanwar23 could we make sure to fix this one before TF 2.2.0 gets officially published ? It’s still an RC release for now 😃

If needed you can use this docker container which contains the right set of dependency and based on the public TF2.2.0rc0 container:

docker pull born2data/tensorflow:hvd-0.19.1_tf_2.2.0rc0

Code to reproduce:

mpirun \
    -np 2 \
    -H localhost:2 \
    -bind-to none \
    -map-by slot \
    -x NCCL_DEBUG=VERSION \
    -x LD_LIBRARY_PATH \
    -x PATH \
    -mca pml ob1 -mca btl ^openib \
    --allow-run-as-root \
    python main.py

import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

(mnist_images, mnist_labels), _ = \
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16', 128)
tf.keras.mixed_precision.experimental.set_policy(policy)

mnist_model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(0.001)

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(opt)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                    optimizer=opt,
                    metrics=['accuracy'],
                    experimental_run_tf_function=False)

callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
    dataset,
    steps_per_epoch=500 // hvd.size(),
    callbacks=callbacks,
    epochs=24,
    verbose=1 if hvd.rank() == 0 else 0
)

Error:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
        _minimize(tape, self.optimizer, loss, self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1739 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:232 apply_gradients
        args=(grads_and_vars, name, all_reduce_sum_gradients))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
        return self._merge_call(merge_fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
        return merge_fn(self._strategy, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:256 _apply_gradients_cross_replica  **
        control_flow_ops.no_op)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/smart_cond.py:54 smart_cond
        return true_fn()
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:248 apply_fn
        all_reduce_sum_gradients))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:262 _apply_gradients
        name, all_reduce_sum_gradients)
    /usr/local/lib/python3.6/dist-packages/horovod/_keras/__init__.py:73 apply_gradients
        raise Exception('`apply_gradients()` was called without a call to '

    Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.

Please let me know how I can help

CC: @nluehr @reedwm @tgaddair @cliffwoolley @omalleyt12 @houtoms

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 15 (13 by maintainers)

Commits related to this issue

Set _HAS_AGGREGATE_GRAD in LossScaleOptimizer. This will fix https://github.com/tensorflow/tensorflow/issues/37765 once cherrypicked into 2.2. PiperOrigin-RevId: 302949295 Change-Id: I9f4370f6c3cb49... — committed to crccw/tensorflow by reedwm 4 years ago
Delegate _aggregate_gradients in LossScaleOptimizer. This is needed for https://github.com/tensorflow/tensorflow/issues/37765. PiperOrigin-RevId: 303449065 Change-Id: I9f0f9e3a3857818e5164c0181406be... — committed to tensorflow/tensorflow by reedwm 4 years ago
Delegate _aggregate_gradients in LossScaleOptimizer. This is needed for https://github.com/tensorflow/tensorflow/issues/37765. PiperOrigin-RevId: 303449065 Change-Id: I9f0f9e3a3857818e5164c0181406be... — committed to reedwm/tensorflow by reedwm 4 years ago

Most upvoted comments

Ah we commented at the same time. Yeah that would fix it but it should be set to True only if the inner optimizer defined it.

I will fix.

reedwm on Mar 21, 2020

This is not closed until bcfc1ba6798ced6889f579644c2c79515832d098 is cherrypicked into 2.2

reedwm on Mar 25, 2020

JFYI. @reedwm By changing the loss_scale_optimizer.py as the following can solve the problem. But we are not sure if this is just a WAR.

class LossScaleOptimizer(optimizer_v2.OptimizerV2):
 ...
 _HAS_ALL_REDUCE_SUM_GRAD = True
 def _aggregate_gradients(self, grads_and_vars):
   return self._optimizer._aggregate_gradients(grads_and_vars)
 ...

kaixih on Mar 21, 2020

Hmmm this is probably because LossScaleOptimizer doesn’t define _HAS_ALL_REDUCE_SUM_GRAD. We should set _HAS_ALL_REDUCE_SUM_GRAD to True if the inner optimizer has set it to true.

reedwm on Mar 21, 2020