tensorflow: Loading model with tf.keras.models.load_model not working on multi GPU
System information
- Custom code
- TensorFlow version 2.1.0
- Python version: 3.7
- GPU model: 4 V100 GPUs on Kubernetes Engine
Describe the current behavior On multi GPU loading the model from a h5 file is not working.
Describe the expected behavior Saving and reloading the model from a h5 file using model.save and keras.models.load_model should work on both single and multi GPU.
Code to reproduce the issue
import tensorflow as tf
import os
import contextlib
import numpy as np
import tensorflow.keras as keras
def get_model():
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer=tf.keras.optimizers.Adam(),
loss='sparse_categorical_crossentropy')
return model
def get_model_path():
model_dir = '/tmp/m' + str(np.random.randint(0, 1000000))
os.makedirs(model_dir)
model_path = os.path.join(model_dir, 'model')
return model_path + ".h5"
def attempt_save_and_reload(model_path, distributed_training=False):
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
with strategy.scope() if distributed_training else contextlib.nullcontext():
model = get_model()
model.fit(
train_images,
train_labels,
epochs=1,
)
model.save(model_path)
model = tf.keras.models.load_model(model_path)
if __name__ == '__main__':
strategy = tf.distribute.MirroredStrategy()
for distributed_training in [False, True]:
print('distributed training: ', distributed_training)
model_path = get_model_path()
try:
attempt_save_and_reload(model_path, distributed_training)
except Exception as e:
print('Exception raised: \n', e)
print()
other info/ logs I need to use h5 files since saving the optimizer state does not work otherwise (see #33424). The logs I get are:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
distributed training: False
Train on 60000 samples
60000/60000 [==============================] - 3s 52us/sample - loss: 0.5991
distributed training: True
Train on 60000 samples
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
60000/60000 [==============================] - 9s 152us/sample - loss: 0.6016
Exception raised:
`handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (6 by maintainers)
Loading and saving keras models inside a distribution strategy scope as TF SavedModel / Checkpoint format is supported and should work as expected. However loading .h5 format model inside a strategy scope is not yet supported.
However as mentioned in one of the early thread (#33424), loading a SavedModel / Checkpoint doesn’t restore the optimizer state. Before the issue is fixed, only model weights will be carried over and the training won’t be resumed exactly where you left off.
@ckkuang sorry for later reply. It was my fault. Basically I’m using https://github.com/qubvel/efficientnet and I have to import the
load_modellike:Otherwise and logically I have
ValueError: Unknown layer: FixedDropout.Davide