tensorflow: [Bug] TF 2.2.0rc0 fails with AMP and Horovod 0.19.1 in Keras compile & fit
With the recent changes in the Tensorflow Keras Optimizer API and Horovod. We did some testing and found that the following configuration was now broken:
- Tensorflow 2.2.0rc0
- Horovod 0.19.1
- AMP + Keras Model Compile & Fit
@sanjoy @pkanwar23 could we make sure to fix this one before TF 2.2.0 gets officially published ? It’s still an RC release for now 😃
If needed you can use this docker container which contains the right set of dependency and based on the public TF2.2.0rc0 container:
docker pull born2data/tensorflow:hvd-0.19.1_tf_2.2.0rc0
Code to reproduce:
mpirun \
-np 2 \
-H localhost:2 \
-bind-to none \
-map-by slot \
-x NCCL_DEBUG=VERSION \
-x LD_LIBRARY_PATH \
-x PATH \
-mca pml ob1 -mca btl ^openib \
--allow-run-as-root \
python main.py
import tensorflow as tf
import horovod.tensorflow.keras as hvd
# Horovod: initialize Horovod.
hvd.init()
# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
(mnist_images, mnist_labels), _ = \
tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())
dataset = tf.data.Dataset.from_tensor_slices(
(tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16', 128)
tf.keras.mixed_precision.experimental.set_policy(policy)
mnist_model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(0.001)
# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(opt)
# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
optimizer=opt,
metrics=['accuracy'],
experimental_run_tf_function=False)
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]
# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
dataset,
steps_per_epoch=500 // hvd.size(),
callbacks=callbacks,
epochs=24,
verbose=1 if hvd.rank() == 0 else 0
)
Error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function *
outputs = self.distribute_strategy.run(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run **
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step **
_minimize(tape, self.optimizer, loss, self.trainable_variables)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1739 _minimize
optimizer.apply_gradients(zip(gradients, trainable_variables))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:232 apply_gradients
args=(grads_and_vars, name, all_reduce_sum_gradients))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2420 merge_call
return self._merge_call(merge_fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2427 _merge_call
return merge_fn(self._strategy, *args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:256 _apply_gradients_cross_replica **
control_flow_ops.no_op)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/smart_cond.py:54 smart_cond
return true_fn()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:248 apply_fn
all_reduce_sum_gradients))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/mixed_precision/experimental/loss_scale_optimizer.py:262 _apply_gradients
name, all_reduce_sum_gradients)
/usr/local/lib/python3.6/dist-packages/horovod/_keras/__init__.py:73 apply_gradients
raise Exception('`apply_gradients()` was called without a call to '
Exception: `apply_gradients()` was called without a call to `get_gradients()` or `_aggregate_gradients`. If you're using TensorFlow 2.0, please specify `experimental_run_tf_function=False` in `compile()`.
Please let me know how I can help
CC: @nluehr @reedwm @tgaddair @cliffwoolley @omalleyt12 @houtoms
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 15 (13 by maintainers)
Commits related to this issue
- Set _HAS_AGGREGATE_GRAD in LossScaleOptimizer. This will fix https://github.com/tensorflow/tensorflow/issues/37765 once cherrypicked into 2.2. PiperOrigin-RevId: 302949295 Change-Id: I9f4370f6c3cb49... — committed to crccw/tensorflow by reedwm 4 years ago
- Delegate _aggregate_gradients in LossScaleOptimizer. This is needed for https://github.com/tensorflow/tensorflow/issues/37765. PiperOrigin-RevId: 303449065 Change-Id: I9f0f9e3a3857818e5164c0181406be... — committed to tensorflow/tensorflow by reedwm 4 years ago
- Delegate _aggregate_gradients in LossScaleOptimizer. This is needed for https://github.com/tensorflow/tensorflow/issues/37765. PiperOrigin-RevId: 303449065 Change-Id: I9f0f9e3a3857818e5164c0181406be... — committed to reedwm/tensorflow by reedwm 4 years ago
Ah we commented at the same time. Yeah that would fix it but it should be set to True only if the inner optimizer defined it.
I will fix.
This is not closed until bcfc1ba6798ced6889f579644c2c79515832d098 is cherrypicked into 2.2
JFYI. @reedwm By changing the loss_scale_optimizer.py as the following can solve the problem. But we are not sure if this is just a WAR.
Hmmm this is probably because LossScaleOptimizer doesn’t define _HAS_ALL_REDUCE_SUM_GRAD. We should set _HAS_ALL_REDUCE_SUM_GRAD to True if the inner optimizer has set it to true.