keras: [BUG] Save Model with multi_gpu

Hey guys,

the new multi_gpu feature seems to have a bug. If you want to save the model you get an error like the one below. To reproduce just run the test multi_gpu_test_simple_model() with parallel_model.save(“logs/model.h5”) at the end.

def multi_gpu_test_simple_model():
    print('####### test simple model')
    num_samples = 1000
    input_dim = 10
    output_dim = 1
    hidden_dim = 10
    gpus = 4
    epochs = 2
    model = keras.models.Sequential()
    model.add(keras.layers.Dense(hidden_dim,
                                 input_shape=(input_dim,)))
    model.add(keras.layers.Dense(output_dim))

    x = np.random.random((num_samples, input_dim))
    y = np.random.random((num_samples, output_dim))
    parallel_model = multi_gpu_model(model, gpus=gpus)

    parallel_model.compile(loss='mse', optimizer='rmsprop')
    parallel_model.fit(x, y, epochs=epochs)

    parallel_model.save("logs/model.h5")


multi_gpu_test_simple_model()

1000/1000 [==============================] - ETA: 0s - loss: 0.4537
Epoch 2/2 1000/1000 [==============================] - ETA: 0s - loss: 0.2939 Traceback (most recent call last): File “steps_kt/test.py”, line 43, in <module> multi_gpu_test_simple_model() File “steps_kt/test.py”, line 40, in multi_gpu_test_simple_model parallel_model.save(“logs/model.h5”) File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py”, line 2555, in save File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/models.py”, line 107, in save_model File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py”, line 2396, in get_config File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 215, in _deepcopy_list append(deepcopy(a, memo)) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in <listcomp> y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in <listcomp> y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 169, in deepcopy rv = reductor(4) TypeError: can’t pickle module objects

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

  • [ x] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • [ x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 20
  • Comments: 59

Most upvoted comments

I just faced the same issue here. In https://keras.io/utils/#multi_gpu_model it clearly stated that the model can be used like the normal model, but it cannot be saved, very funny. I can’t even perform reinforced training just because I cannot save the previous model trained with multiple GPUs. If trained with single GPU, the rest of my invested GPUs will become useless. Please urge the developer to look into this bug ASAP.

Well, none of answers above helps at all.

Take a look at: the answer from fchollet. https://github.com/fchollet/keras/issues/8446

He said “For now we recommend saving the original (template) model instead of the parallel model. I.e. call save on the model you passed to multi_gpu_model, not the model returned by it. Both models share the same weights.”

This is my example code: Please note that model -> template model gpu_model -> multi_gpu_model They are different.

# ------------- model ----------------------------
model = Sequential()
model.add(Conv2D(32, (5, 5), padding='same',
                 input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (5, 5)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

# ------------------- pass the the template model to gpu ---------------------
if ngpus > 1:
    gpu_model = multi_gpu_model(model,ngpus)

gpu_model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

datagen.fit(x_train)

# Fit the model on the batches generated by datagen.flow().
gpu_model.fit_generator(datagen.flow(x_train, y_train,
                                 batch_size=batch_size),
                    steps_per_epoch=int(np.ceil(x_train.shape[0] / float(batch_size))),
                    epochs=nb_epoch,
                    validation_data=(x_test, y_test),
                    workers=2)

score = gpu_model.evaluate(x_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

# ------------ save the template model rather than the gpu_mode ----------------
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

# -------------- load the saved model --------------
from keras.models import model_from_json

# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy',
                     optimizer='adadelta',
                     metrics=['accuracy'])
score = loaded_model.evaluate(x_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

I’ve found the workaround. See this StackOverflow answer for details. The code for multi_gpu_model:

from keras.layers import Lambda, concatenate
from keras import Model

import tensorflow as tf

def multi_gpu_model(model, gpus):
  if isinstance(gpus, (list, tuple)):
    num_gpus = len(gpus)
    target_gpu_ids = gpus
  else:
    num_gpus = gpus
    target_gpu_ids = range(num_gpus)

  def get_slice(data, i, parts):
    shape = tf.shape(data)
    batch_size = shape[:1]
    input_shape = shape[1:]
    step = batch_size // parts
    if i == num_gpus - 1:
      size = batch_size - step * i
    else:
      size = step
    size = tf.concat([size, input_shape], axis=0)
    stride = tf.concat([step, input_shape * 0], axis=0)
    start = stride * i
    return tf.slice(data, start, size)

  all_outputs = []
  for i in range(len(model.outputs)):
    all_outputs.append([])

  # Place a copy of the model on each GPU,
  # each getting a slice of the inputs.
  for i, gpu_id in enumerate(target_gpu_ids):
    with tf.device('/gpu:%d' % gpu_id):
      with tf.name_scope('replica_%d' % gpu_id):
        inputs = []
        # Retrieve a slice of the input.
        for x in model.inputs:
          input_shape = tuple(x.get_shape().as_list())[1:]
          slice_i = Lambda(get_slice,
                           output_shape=input_shape,
                           arguments={'i': i,
                                      'parts': num_gpus})(x)
          inputs.append(slice_i)

        # Apply model on slice
        # (creating a model replica on the target device).
        outputs = model(inputs)
        if not isinstance(outputs, list):
          outputs = [outputs]

        # Save the outputs for merging back together later.
        for o in range(len(outputs)):
          all_outputs[o].append(outputs[o])

  # Merge outputs on CPU.
  with tf.device('/cpu:0'):
    merged = []
    for name, outputs in zip(model.output_names, all_outputs):
      merged.append(concatenate(outputs,
                                axis=0, name=name))
    return Model(model.inputs, merged)

Plus, while loading the model, pass in the tensorflow object, like this:

model = load_model('multi_gpu_model.h5', {'tf': tf})

When the bug is fixed in keras, you’ll only need to import the right multi_gpu_model.

So I guess this should work :

modelGPU.__setattr__('callback_model',modelCPU)
#now we can train as normal and the weights saving in our callbacks will be done by the CPU model
modelGPU.fit_generator( . . .

I have a simplified callback for doing model checkpointing for multiple/single gpu models.

from keras.models import Model
from keras.callbacks import ModelCheckpoint

class MultiGPUCheckpoint(ModelCheckpoint):
    
    def set_model(self, model):
        if isinstance(model.layers[-2], Model):
            self.model = model.layers[-2]
        else:
            self.model = model

Because this inherits from ModelCheckpoint, you can use it in place of the ModelCheckpoint callback during fit/fit_generator.

model.fit(X_train, y_train,
    callbacks=[
        MultiGPUCheckpoint('model.h5', save_best_only=True)
    ]
)

Of course if you have a custom model containing another model in the second to last layer, this method is not going to do what you want.

@Weixing-Zhang answer did not work for me, I was trying to save the weights with a callback function and if it did save the weights during the training, when I was loading them, I had an error basically saying that I was trying to load 1 weight into 34. Did I do something wrong ? probably, but just a warning to everyone, if you load your weights by name, you won’t see the error I had (if you have it) and the previous answers will look like they are working (with disastrous predictions) . Anyway, here is the code I use for the callback (a mix of various previous answers), it may be of use to someone:

class CustomModelCheckpoint(Callback):

    def __init__(self, model_parallel, path):

        super().__init__()

        self.model = model_parallel
        self.path = path

        # Put your model here
        self.model_for_saving = SSD(num_classes=(NUM_CLASSES), weights='../data/weights/weights_300x300_old.hdf5')

    def on_epoch_end(self, epoch, logs=None):

        loss = logs['val_loss']
        self.model_for_saving.set_weights(self.model.get_weights())

        print("\nSaving model to : {}".format(self.path.format(epoch=epoch, val_loss=loss)))
        self.model_for_saving.save_weights(self.path.format(epoch=epoch, val_loss=loss), overwrite=True)

# Setting the callback functions
checkpointsString = "path/to/save/" + 'weights.{epoch:02d}-{val_loss:.2f}.hdf5'
callbacks = [CustomModelCheckpoint(model_parallel, checkpointsString), keras.callbacks.LearningRateScheduler(schedule)]

history = model.fit_generator(...,
                              callbacks=callbacks,
                              ...)

@D3lt4lph4 the problem is with this line in the keras code, as already discussed above:

def multi_gpu_model(model, gpus):
  ...
  import tensorflow as tf
  ...

This creates a closure for the get_slice lambda function, which includes the number of gpus (that’s ok) and tensorflow module (not ok). Model save tries to serialize all layers, including the ones that call get_slice and fails exactly because tf is in the closure.

My solution is to move import out of multi_gpu_model, so that tf becomes a global object, though still needed for get_slice to work. This fixes the problem of saving, but in loading one has to provide tf explicitly. I’m sure the last part can be done by keras itself to make it look seamless.

Hi. Seems like I’ve found the solution for mine case. Just compile the base model, then transfer the trained weights of GPU model back to base model itself, then it was able to be saved like usual and perform like GPU model, walla!

autoencoder.compile() # since the GPU model is compiled, now only compile the base model output = autoencoder.predict(img) # the output will be a mess since only the GPU model is trained, not the base model output = parallel_autoencoder.predict(img) # the output is a clear image from well-trained GPU model autoencoder.set_weights(parallel_autoencoder.get_weights()) # transfer the trained weights from GPU model to base model output = autoencoder.predict(img) # perform the prediction again and the result is similar to the GPU model autoencoder.save(‘CAE.h5’) # now the mode can be saved with transferred weights from GPU model.

The saved model can be loaded and modified as usual. Hope it helps.

I have the same problem while trying to model.save('..'). a parallel_model. I also first discovered while using ModelCheckpoint callback to save results between epochs. However if I used the option ModelCheckpoint(..., save_weights_only=True), to use model.save_weights() it seems to work.

Hi, I have tried some(/all) of the solutions above, chances are that I use them wrong, but these are unpleasant experiences. So, is there will be any OFFICIAL support for model checkpoint saving with ease? To me, this is just necessary and make no scene to leave this issue unsolved for such a long time.

@yunkchen I guess you need to compile first

import tensorflow as tf
with tf.device("/cpu:0"):
    model = build_model()

gpu_model = multi_gpu_model(model, gpus=2)
gpu_model.compile(optimizer=Adam(lr=0.0001,clipnorm = 0.5),loss=binary_accuracy)
gpu_model.__setattr__('callback_model',model)

Okay, so I think I know why @Weixing-Zhang solution wasn’t working with the checkpoints.

I did a little digging in the keras github and it seems that when the call to fit_generator is made, the model in the callback is set to be the model making the call to fit_generator. So even if the correct model is set beforehand when creating the callback, this will be overwritten by the multi_gpu one.

So here is the modify version of the previous Class


class CustomModelCheckpoint(Callback):

    def __init__(self, model, path):

        super().__init__()

        # This is the argument that will be modify by fit_generator
        # self.model = model
        self.path = path

        # We set the model (non multi gpu) under an other name
        self.model_for_saving = model

    def on_epoch_end(self, epoch, logs=None):

        loss = logs['val_loss']
        # Here we save the original one
        print("\nSaving model to : {}".format(self.path.format(epoch=epoch, val_loss=loss)))
        self.model_for_saving.save_weights(self.path.format(epoch=epoch, val_loss=loss), overwrite=True)

# Setting the callback functions
checkpointsString = args.checkpoints + 'weights.{epoch:02d}-{val_loss:.2f}.hdf5'
callbacks = [CustomModelCheckpoint(model, checkpointsString), keras.callbacks.LearningRateScheduler(schedule)]

gpu_model.compile(...)

# The call here will use the set_model() function of the callback to set the model, but since we do not use this model for saving, all good
history = gpu_model.fit_generator(...) 

I guess this is a bug regarding the multi_gpu (and not a nice one to correct from what I see of the keras code).

Seems like that this issue not solved for keras 2.1.2. Refer to ChristofHenkel’s reply, model could be saved during training, but format of the saved model is not correct when load for inference.

@ParikhKadam I’m not in the keras team and I didn’t promise the fix. Based on latest activity here it has been fixed, but unfortunately I don’t have any additional information. Please ask the team.

@maxim5 You said that the issue will be solved but I don’t know if it is. Is it solved?

I am facing a similar issue with saving/loading multi_gpu_model here