tensorflow: Restore from checkpoint loads optimizer incorrectly

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.6.0-rc2-32-g919f693420e 2.6.0
Python version: 3.8
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version: 11.4
GPU model and memory: 1080TI 11Gb

Describe the current behavior

After checkpoint restoration optimizer weights are different from optimizer weights before saving checkpoint.
As assert_consumed notifies checkpoint file has unresolved optimizer slots (variables).

Unresolved object in checkpoint (root).optimizer.iter: attributes {
  name: "VARIABLE_VALUE"
  full_name: "Adam/iter"
  checkpoint_key: "optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE"
}

Describe the expected behavior

The optimizer weights should be the same before save and after load.
The assert_consumed should not warn about anything.

Contributing

Do you want to contribute a PR? (yes/no): no
Briefly describe your candidate solution(if contributing): -

Standalone code to reproduce the issue The reproducible code example is presented in colab.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 12
Comments: 19 (9 by maintainers)

Commits related to this issue

Clarify docstrings that mention assert_consumed() In response to discussions in https://github.com/tensorflow/tensorflow/issues/52346 * Rename delay / delayed restorations to defer / deferred rest... — committed to tensorflow/docs by tensorflower-gardener 3 years ago
Clarify docstrings that mention assert_consumed() In response to discussions in https://github.com/tensorflow/tensorflow/issues/52346 * Rename delay / delayed restorations to defer / deferred rest... — committed to tensorflow/tensorflow by pcish 3 years ago

Most upvoted comments

Thanks for sharing your use case and the issues you encountered @RomanSteinberg!

Regarding the original question, I found there’s already an open internal feature request for adding an option to disable delayed restoration while loading a checkpoint, which is roughly equivalent to adding a try_consume_everything_now() function, so I’ve added a reference to this conversation on that feature request.

A follow up on your use case: it sounds like you have the source code for a model, and also a Checkpoint for it. Also from the example you gave at the start, I assume you are using Keras. Loading a Checkpoint with the model’s source code, then calling tf.keras.models.save_model() with the model should export the source code and the weights from the Checkpoint as a SavedModel (that can be used for inference). Was this what you tried but found something tricky?

net = load(ckpt_dir, False)  # change load to return net instead of weights
tf.keras.models.save_model(net, '/tmp/saved_model')

inference_model = tf.keras.models.load_model('/tmp/saved_model')

pcish on Dec 8, 2021

@pcish thank you! These are quite clear answers. I agree both points (1-2).

My high-level goals are typical. Train model (experiments), pick out trained model and deploy it into production service. Unfortunately, I use TF irregularly and cann’t follow all trends it suggests from time to time. I mean TF changes usage best practices from time to time. A year or two ago I tried to use TFX, but it was almost impossible to use at that moment. The problem which led me to this issue was to take Checkpoint and make tow things: code for inference in production service and use it later in training.

a) I tried to load Checkpoint and convert it into SavedModel for inference. It is quite tricky, so I stopped. Decided to use Checkpoint in production service. So, I need to load it and start inference, but it led to use ridiculous code like model.predict(np.zeros(shape)) after load. I tried to use model.build, but as I can see it was removed from guides and is not recommended so.

b) On other side I’m trying to completely load checkpoint before continue training, and there are some problems with loading optimizer weights (as I showed in this issue). I’ve noticed incompleteness and started to use assert_consumed and …

PS: A bit of my thoughts. The TF usage became not user friendly. It was easy in v1.0-1.14 and idea to take keras in is the last good step towards simplification of usage.

RomanSteinberg on Dec 7, 2021

@tilakrayal I read that issue thread and found two workarounds but not the solution. Workarounds are 1. Use predict method to warm up the model before loading weights. I don’t think that it is ok to predict some garbage before loading weights. I’m sure Tensorflow is not designed to work this way.
2. Use tf.Variable to specify Adam parameters when construct optimizer. There is no information that Adam optimizer should be constructed like. Moreover default construct behavior is to specify float, but not tf.Variable. I’m sure Tensorflow is not designed to work this way.

So, please, tell me if I missed something in that issue. Please, tell me if my code contradicts with some TF design, TF pattern or some usage guide. If not, please, admit that it is a bug. I’m ready to contribute to TF to fix it. But I need instructions on what way it should be fixed not to break any idea in one of TF modules.

RomanSteinberg on Oct 19, 2021