tensorflow: Loading model with tf.keras.models.load_model not working on multi GPU

System information

Custom code
TensorFlow version 2.1.0
Python version: 3.7
GPU model: 4 V100 GPUs on Kubernetes Engine

Describe the current behavior On multi GPU loading the model from a h5 file is not working.

Describe the expected behavior Saving and reloading the model from a h5 file using model.save and keras.models.load_model should work on both single and multi GPU.

Code to reproduce the issue

import tensorflow as tf 
import os
import contextlib
import numpy as np
import tensorflow.keras as keras  

def get_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(10, activation=tf.nn.softmax)
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                      loss='sparse_categorical_crossentropy')
    return model

def get_model_path():
    model_dir = '/tmp/m' + str(np.random.randint(0, 1000000))
    os.makedirs(model_dir)
    model_path = os.path.join(model_dir, 'model')
    return model_path + ".h5"

def attempt_save_and_reload(model_path, distributed_training=False):
    fashion_mnist = keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    train_images = train_images / 255.0
    test_images = test_images / 255.0

    with strategy.scope() if distributed_training else contextlib.nullcontext():
        model = get_model()
        model.fit(
            train_images,
            train_labels,
            epochs=1,
        )
        model.save(model_path)
        model = tf.keras.models.load_model(model_path)

if __name__ == '__main__':
    strategy = tf.distribute.MirroredStrategy()
    for distributed_training in [False, True]:
        print('distributed training: ', distributed_training)
        model_path = get_model_path()
        try:
            attempt_save_and_reload(model_path, distributed_training)
        except Exception as e:
            print('Exception raised: \n', e)
        print()

other info/ logs I need to use h5 files since saving the optimizer state does not work otherwise (see #33424). The logs I get are:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
distributed training:  False
Train on 60000 samples
60000/60000 [==============================] - 3s 52us/sample - loss: 0.5991

distributed training:  True
Train on 60000 samples
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
60000/60000 [==============================] - 9s 152us/sample - loss: 0.6016
Exception raised: 
 `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (6 by maintainers)

Most upvoted comments

Loading and saving keras models inside a distribution strategy scope as TF SavedModel / Checkpoint format is supported and should work as expected. However loading .h5 format model inside a strategy scope is not yet supported.

However as mentioned in one of the early thread (#33424), loading a SavedModel / Checkpoint doesn’t restore the optimizer state. Before the issue is fixed, only model weights will be carried over and the training won’t be resumed exactly where you left off.

ckkuang on Jan 24, 2020

@ckkuang sorry for later reply. It was my fault. Basically I’m using https://github.com/qubvel/efficientnet and I have to import the load_model like:

import efficientnet.tfkeras
from tensorflow.keras.models import load_model

model = load_model('path/to/model.h5')

Otherwise and logically I have ValueError: Unknown layer: FixedDropout.

Davide

murdav on Apr 18, 2020