tensorflow: Keras models train correctly with or without tf.function decorator but this is not correct for custom models

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): - TensorFlow version (use command below):
  • Python version: - Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: - GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 22 (11 by maintainers)

Most upvoted comments

Please do not do softmax and then cross-entropy; use softmax_cross_entropy_with_logits instead

On Mon, Mar 16, 2020 at 2:47 PM Milad Toutounchian notifications@github.com wrote:

I have changed the code based on @alextp https://github.com/alextp which do softmax first then followed by cross entropy: keras_vs_custom_TF2.zip https://github.com/tensorflow/tensorflow/files/4340550/keras_vs_custom_TF2.zip

Now, the custom model loss is really large compared to Keras model. Also the accuracy of custom model is not as good as Keras model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/36998#issuecomment-599772546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABHRNDZBEURS2H226DOQTRH2M6PANCNFSM4KZ5RKAA .

  • Alex

Hi, I was able to replicate the issue, thanks for clarifying. I have two distinct answers, and I will start with the obvious but unsatisfactory one, before moving to what appears like an actual issue.

  1. Your custom Model does not abide by the current API. If you replace it with a custom keras Model subclass, such as the one implemented below, then it trains perfectly, with or without tf.function decorating your custom training step.
class ModelKeras(tf.keras.Model):

    def __init__(self):
      super().__init__()
      kwargs = {'kernel_initializer': 'normal', 'bias_initializer': 'normal'}
      self.layer_1 = tf.keras.layers.Dense(512, 'relu', **kwargs)
      self.layer_2 = tf.keras.layers.Dense(512, 'relu', **kwargs)
      self.out_layer = tf.keras.layers.Dense(10, **kwargs)

    @property
    def trainable_vars(self):  # merely to leave the rest of the code unchanged
        return self.trainable_variables

    def call(self, inputs):
      output = self.layer_1(inputs)
      output = self.layer_2(output)
      return self.out_layer(output)
  1. Gradient computation differs when tf.function decorates propagate. I do not get why, but here is the test I ran:
# Define a function to compute gradients of a network's weights w.r.t. a given batch.
def compute_gradients(model, x_batch, y_batch):
    with tf.GradientTape() as tape:
        pred = tf.nn.softmax(model(x_batch))
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_batch, pred)
    return tape.gradient(loss, model.trainable_vars)
# Make a tf.function-decorated copy of the previous.
decorated_gradients = tf.function(compute_gradients)

# Gather a training batch.
x_batch, y_batch = next(iter(mnist_dataset()))
# Instantiate two models and build them.
model_custom = Model()  # weights are built at instantiation
model_keras = ModelKeras()
_ = model_keras(x_batch)  # build weights through sample processing
# Set the second model's weights equal to those of the first one.
weights = [
    w.numpy() for pair in zip(model_custom.trainable_vars[:3], model_custom.trainable_vars[3:])
    for w in pair
]
model_keras.set_weights(weights)

# Compute gradients for both models without tf.function.
# Save for ordering, the results are the same for both, as should be.
compute_gradients(model_custom, x_batch, y_batch)
compute_gradients(model_keras, x_batch, y_batch)

# Compute gradients for both models with tf.function.
# Save for ordering, the results are the same for both, as should be.
# However, they differ from the outputs of the the non-decorated function, which is weird.
decorated_gradients(model_custom, x_batch, y_batch)
decorated_gradients(model_keras, x_batch, y_batch)

So, for some reason, it appears that tf.function decoration changes the way gradients are computed, which might be the cause of the model’s lack of convergence. As a matter of fact, when not decorated, the computed gradients tend to be very sparse, i.e. there are a lot of zero values resulting in most weights not being updated during the training step. I do not know why this is the case; it would seem that part of the computation is not properly tracked?

Now, the reason why the keras Model trains better is also that in spite of my forcing the use of random normal weights initializers, the initial weights (when not forcefully replaced as in the previous test) are smaller than that generated in the custom model, which seems to result in smoother initial predictions and may explain why it is easier to train.

@Saduf2019 Decorating with @tf.function may indeed have benefits as to execution runtime, however it should not have any effect on the accuracy reached, unless there is either a tensorflow bug, or some error-inducing side effects within @miladtoutounchian’s code.

@miladtoutounchian I have run the code shared by you on tf 2.1 with and with out @tf.function and did not face any issues,please find the gist for the same. the same code runs without any issues on nightly as well. in case your still facing issue please share a gist where the error is seen along with error logs if any.