tensorflow: Restoring Keras model fails inside a distribution strategy scope
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
- TensorFlow installed from (source or binary): binary (using
pip) - TensorFlow version (use command below): both
v1.14.0-rc1-22-gaf24dc9 1.14.0andv2.0.0-beta0-17-g8e423e3 2.0.0-beta1 - Python version: 3.7.3
- CUDA/cuDNN version: CUDA 10.1.168-4, cuDNN 7.6.1.34-1
- GPU model and memory: NVIDIA Quadro P2000, 4GB
Describe the current behavior
Inside a distribution strategy scope, restoring a Keras model (that has been trained at all) with tf.keras.models.load_model raises the exception shown below (while handling the optimizer in particular, it seems).
(Looks a bit similar to #28599 if you squint, but many details differ.)
Describe the expected behavior Restoring the model should succeed.
Code to reproduce the issue
import numpy as np, tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
path = "/tmp/model.hdf5"
with strategy.scope():
# Construct model.
model = tf.keras.models.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(optimizer=tf.keras.optimizers.SGD(), loss=tf.keras.metrics.mse)
# Do a fit so the optimizer weights are created. Removing this lets the restore succeed.
model.fit(np.array([[1]]), np.array([[1]]))
# Save and attempt to restore.
tf.keras.models.save_model(model, path)
tf.keras.models.load_model(path)
Other info / logs Traceback for TF 2.0 (TF 1.14 is the same except for line numbers):
File ".../tensorflow/python/keras/saving/save.py", line 137, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File ".../tensorflow/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
model._make_train_function()
File ".../tensorflow/python/keras/engine/training.py", line 1974, in _make_train_function
params=self._collected_trainable_weights, loss=self.total_loss)
File ".../tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 491, in get_updates
grads = self.get_gradients(loss, params)
File ".../tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
grads = gradients.gradients(loss, params)
File ".../tensorflow/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File ".../tensorflow/python/ops/gradients_util.py", line 543, in _GradientsHelper
for x in xs
File ".../tensorflow/python/ops/gradients_util.py", line 543, in <listcomp>
for x in xs
File ".../tensorflow/python/distribute/values.py", line 643, in handle
raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (7 by maintainers)
I think the optimizer weights issue is because sometimes optimizer weights are only created during the first step. So when you try to load the weights before training starts, weights have not yet been created. One workaround can be to just do 1 step first and then load the weights. @k-w-w do you know what could be a better alternative for loading optimizer weights ?
Thanks, it seems to work with the 2.0 preview version.
For TF 1.14, I’ve found a workaround, which I’ll document here for anyone else coming across this:
tf.keras.models.save_model.tf.keras.models.load_model. Retrieve its weights usingmodel.get_weights()andmodel.optimizer.get_weights().model.fit, so the copying cannot be done in the normal flow of code, but theon_train_begincallback runs after the weights are created and before any training has happened.An example of what the callback might look like:
For TF 2.0, I have
tf.version.VERSION == '2.0.0-beta1'andtf.version.GIT_VERSION == 'v2.0.0-beta0-17-g8e423e3'. I installed viapip install tensorflow-gpu==2.0.0b1. That is the latest released version on both GitHub and PyPI, as far as I can tell.If I understand TF correctly, I don’t need to store the graph, as I am restoring in Python with access to the same code that created the model originally. So perhaps
save_modelandload_modelare overkill. However, I do need the state of the optimizer, which rules out using justmodel.{save,load}_weights. I also attempted to load one copy of the model outside the scope, construct another copy inside the scope, and then usemodel.{get,set}_weightsandmodel.optimizer.{get,set}_weights, but it seems that the optimizer weights are not created untilmodel.fithas been called with data, so that runs into issues around mismatch of expected number of weights.In TF 1.14, I also tried
tf.contrib.saved_model.{save,load}_keras_model, which fails in a different way:gives this exception inside the second
model.fit:Hi - we don’t support saving with hdf5 format. However, you can save and restore with the standard TF format - just remove the hdf5 extension from the file path. See https://www.tensorflow.org/beta/tutorials/distribute/save_and_load for more information.