addons: Missing argument in apply_gradients() in AdamW optimizer

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
TensorFlow version and how it was installed (source or binary): tf-nightly 2.2.0-dev20200309 (pip install)
TensorFlow-Addons version and how it was installed (source or binary): 0.8.3 (pip install)
Python version: 3.6
Is GPU used? (yes/no): yes

Describe the bug

When running model.compile() with AdamW optimizer, a type error is thrown saying: apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'

This can be fixed by adding in the argument to apply_gradients() in tensorflow_addons/optimizers/weight_decay_optimizers.py

Code to reproduce the issue

https://colab.research.google.com/drive/1A6X8yYii5M8BDqwAFvoFglrTIqWvwQLm

Other info / logs

TypeError: in user code:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:503 train_function  *
    outputs = self.distribute_strategy.experimental_run_v2(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:920 experimental_run_v2  **
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2254 call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2615 _call_for_each_replica
    return fn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:473 train_step  **
    _minimize(tape, self.optimizer, loss, self.trainable_variables)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1737 _minimize
    all_reduce_sum_gradients=False)

TypeError: apply_gradients() got an unexpected keyword argument 'all_reduce_sum_gradients'

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 22 (17 by maintainers)

Commits related to this issue

Add **kwargs to apply_gradients in weight_decay_optimizers. This is a quick fix to restore compatibility with TF 2.2 which adds a new keyword argument. See #1267 for the discussion. — committed to PhilJd/addons by PhilJd 4 years ago
Add **kwargs to apply_gradients in weight_decay_optimizers. (#1566) * Add **kwargs to apply_gradients in weight_decay_optimizers. This is a quick fix to restore compatibility with TF 2.2 which add... — committed to tensorflow/addons by PhilJd 4 years ago
Add **kwargs to apply_gradients in weight_decay_optimizers. (#1566) * Add **kwargs to apply_gradients in weight_decay_optimizers. This is a quick fix to restore compatibility with TF 2.2 which add... — committed to jrruijli/addons by PhilJd 4 years ago

Most upvoted comments

You code works with tensorflow==2.1.0 and tfa-nightly.

We currently target the stable version of tensorflow (2.1.0) to, as you can see, keep the sanity of our devs. So we don’t expect everything to work with tf-nightly.

But what you are reporting is concerning. Either :

the bug comes from tensorflow and they’re not being backward compatible there
We use some undocumented/private API in AdamW and we should remove them.

In all cases let’s keep this issue open.

gabrieldemarmiesse on Mar 9, 2020

I solved it, Just upgrade to new version tfa 0.10.0 pip install tensorflow_addons==0.10.0

montaserFath on Jun 29, 2020

Sure, feel free to ping me if I happen to miss the announcement 😃

PhilJd on Apr 4, 2020

I’m getting a similar error,

    TypeError: tf__apply_gradients() got an unexpected keyword argument 'experimental_aggregate_gradients'

This is from attempting a custom model

from tensorflow.python.keras.mixed_precision.experimental import loss_scale_optimizer as lso
from tensorflow.python.distribute import parameter_server_strategy

def _minimize(strategy, tape, optimizer, loss, trainable_variables):
    with tape:
        if isinstance(optimizer, lso.LossScaleOptimizer):
            loss = optimizer.get_scaled_loss(loss)

    gradients = tape.gradient(loss, trainable_variables)
    gradients = [(ClipIfNotNone(grad)) for grad in gradients]
    gradients = [(ClipIfNotNone2(grad)) for grad in gradients]
    # Whether to aggregate gradients outside of optimizer. This requires support
    # of the optimizer and doesn't work with ParameterServerStrategy and
    # CentralStroageStrategy.
    aggregate_grads_outside_optimizer = (
        optimizer._HAS_AGGREGATE_GRAD and  # pylint: disable=protected-access
        not isinstance(strategy.extended,
                        parameter_server_strategy.ParameterServerStrategyExtended))

    if aggregate_grads_outside_optimizer:
        # We aggregate gradients before unscaling them, in case a subclass of
        # LossScaleOptimizer all-reduces in fp16. All-reducing in fp16 can only be
        # done on scaled gradients, not unscaled gradients, for numeric stability.
        gradients = optimizer._aggregate_gradients(zip(gradients,  # pylint: disable=protected-access
                                                    trainable_variables))
    if isinstance(optimizer, lso.LossScaleOptimizer):
        gradients = optimizer.get_unscaled_gradients(gradients)
    gradients = optimizer._clip_gradients(gradients)  # pylint: disable=protected-access
    if trainable_variables:
        if aggregate_grads_outside_optimizer:
            optimizer.apply_gradients(
                zip(gradients, trainable_variables),
                experimental_aggregate_gradients=False)
        else:
            optimizer.apply_gradients(zip(gradients, trainable_variables))

class CustomModel(tf.keras.Model):
    def train_step(self, data):
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x = data
        y = tf.constant([1.0], dtype=tf.float32)

        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)  # Forward pass
            # Compute the loss value
            # (the loss function is configured in `compile()`)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
        
        _minimize(self.distribute_strategy, tape, self.optimizer, loss,
                self.trainable_variables)

        self.compiled_metrics.update_state(y, y_pred, sample_weight)
        return {m.name: m.result() for m in self.metrics}

This is with the AdamW optimizer as well.

Santosh-Gupta on Jun 28, 2020

For AdamW https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L130-L181

bhack on Jun 28, 2020