tensorflow: Restoring Keras model fails inside a distribution strategy scope

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
TensorFlow installed from (source or binary): binary (using pip)
TensorFlow version (use command below): both v1.14.0-rc1-22-gaf24dc9 1.14.0 and v2.0.0-beta0-17-g8e423e3 2.0.0-beta1
Python version: 3.7.3
CUDA/cuDNN version: CUDA 10.1.168-4, cuDNN 7.6.1.34-1
GPU model and memory: NVIDIA Quadro P2000, 4GB

Describe the current behavior Inside a distribution strategy scope, restoring a Keras model (that has been trained at all) with tf.keras.models.load_model raises the exception shown below (while handling the optimizer in particular, it seems).

(Looks a bit similar to #28599 if you squint, but many details differ.)

Describe the expected behavior Restoring the model should succeed.

Code to reproduce the issue

import numpy as np, tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
path = "/tmp/model.hdf5"

with strategy.scope():
    # Construct model.
    model = tf.keras.models.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
    model.compile(optimizer=tf.keras.optimizers.SGD(), loss=tf.keras.metrics.mse)
    # Do a fit so the optimizer weights are created. Removing this lets the restore succeed.
    model.fit(np.array([[1]]), np.array([[1]]))
    # Save and attempt to restore.
    tf.keras.models.save_model(model, path)
    tf.keras.models.load_model(path)

Other info / logs Traceback for TF 2.0 (TF 1.14 is the same except for line numbers):

  File ".../tensorflow/python/keras/saving/save.py", line 137, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File ".../tensorflow/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
    model._make_train_function()
  File ".../tensorflow/python/keras/engine/training.py", line 1974, in _make_train_function
    params=self._collected_trainable_weights, loss=self.total_loss)
  File ".../tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 491, in get_updates
    grads = self.get_gradients(loss, params)
  File ".../tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
    grads = gradients.gradients(loss, params)
  File ".../tensorflow/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File ".../tensorflow/python/ops/gradients_util.py", line 543, in _GradientsHelper
    for x in xs
  File ".../tensorflow/python/ops/gradients_util.py", line 543, in <listcomp>
    for x in xs
  File ".../tensorflow/python/distribute/values.py", line 643, in handle
    raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (7 by maintainers)

Most upvoted comments

I think the optimizer weights issue is because sometimes optimizer weights are only created during the first step. So when you try to load the weights before training starts, weights have not yet been created. One workaround can be to just do 1 step first and then load the weights. @k-w-w do you know what could be a better alternative for loading optimizer weights ?

guptapriya on Aug 25, 2019

Thanks, it seems to work with the 2.0 preview version.

For TF 1.14, I’ve found a workaround, which I’ll document here for anyone else coming across this:

Save the model with tf.keras.models.save_model.
Load a copy of the model outside the strategy scope using tf.keras.models.load_model. Retrieve its weights using model.get_weights() and model.optimizer.get_weights().
Create and compile a new copy of the model inside the scope, in the same way as you did the first time.
Now, in a callback, copy the loaded weights to the newly compiled model and its optimizer. The callback is key: the optimizer weights are not created before the call to model.fit, so the copying cannot be done in the normal flow of code, but the on_train_begin callback runs after the weights are created and before any training has happened.

An example of what the callback might look like:

class LoadWeightsCallback(tf.keras.callbacks.Callback):
    _chief_worker_only = False

    def __init__(self, weights, optimizer_weights):
        self.weights = weights
        self.optimizer_weights = optimizer_weights

    def on_train_begin(self, logs=None):
        self.model.set_weights(self.weights)
        self.model.optimizer.set_weights(self.optimizer_weights)

dzhu on Jul 31, 2019

For TF 2.0, I have tf.version.VERSION == '2.0.0-beta1' and tf.version.GIT_VERSION == 'v2.0.0-beta0-17-g8e423e3'. I installed via pip install tensorflow-gpu==2.0.0b1. That is the latest released version on both GitHub and PyPI, as far as I can tell.

If I understand TF correctly, I don’t need to store the graph, as I am restoring in Python with access to the same code that created the model originally. So perhaps save_model and load_model are overkill. However, I do need the state of the optimizer, which rules out using just model.{save,load}_weights. I also attempted to load one copy of the model outside the scope, construct another copy inside the scope, and then use model.{get,set}_weights and model.optimizer.{get,set}_weights, but it seems that the optimizer weights are not created until model.fit has been called with data, so that runs into issues around mismatch of expected number of weights.

In TF 1.14, I also tried tf.contrib.saved_model.{save,load}_keras_model, which fails in a different way:

import numpy as np, tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
path = "/tmp/model"

with strategy.scope():
    model = tf.keras.models.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
    model.compile(optimizer=tf.keras.optimizers.SGD(), loss=tf.keras.metrics.mse)
    model.fit(np.array([[1]]), np.array([[1]]))
    tf.contrib.saved_model.save_keras_model(model, path)

    model = tf.contrib.saved_model.load_keras_model(path)
    model.compile(optimizer=tf.keras.optimizers.SGD(), loss=tf.keras.metrics.mse)
    model.fit(np.array([[1]]), np.array([[1]]))

gives this exception inside the second model.fit:

  File ".../tensorflow/python/keras/engine/training.py", line 649, in fit
    validation_freq=validation_freq)
  File ".../tensorflow/python/keras/engine/training_distributed.py", line 143, in fit_distributed
    steps_name='steps_per_epoch')
  File ".../tensorflow/python/keras/engine/training_arrays.py", line 274, in model_iteration
    batch_outs = f(actual_inputs)
  File ".../tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File ".../tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_2/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_2/kernel/N10tensorflow3VarE does not exist.
         [[{{node dense_3/MatMul/ReadVariableOp}}]]

dzhu on Jul 21, 2019

Hi - we don’t support saving with hdf5 format. However, you can save and restore with the standard TF format - just remove the hdf5 extension from the file path. See https://www.tensorflow.org/beta/tutorials/distribute/save_and_load for more information.

guptapriya on Jul 19, 2019