keras: [BUG] Save Model with multi_gpu
Hey guys,
the new multi_gpu feature seems to have a bug. If you want to save the model you get an error like the one below. To reproduce just run the test multi_gpu_test_simple_model() with parallel_model.save(“logs/model.h5”) at the end.
def multi_gpu_test_simple_model():
print('####### test simple model')
num_samples = 1000
input_dim = 10
output_dim = 1
hidden_dim = 10
gpus = 4
epochs = 2
model = keras.models.Sequential()
model.add(keras.layers.Dense(hidden_dim,
input_shape=(input_dim,)))
model.add(keras.layers.Dense(output_dim))
x = np.random.random((num_samples, input_dim))
y = np.random.random((num_samples, output_dim))
parallel_model = multi_gpu_model(model, gpus=gpus)
parallel_model.compile(loss='mse', optimizer='rmsprop')
parallel_model.fit(x, y, epochs=epochs)
parallel_model.save("logs/model.h5")
multi_gpu_test_simple_model()
1000/1000 [==============================] - ETA: 0s - loss: 0.4537
Epoch 2/2 1000/1000 [==============================] - ETA: 0s - loss: 0.2939 Traceback (most recent call last): File “steps_kt/test.py”, line 43, in <module> multi_gpu_test_simple_model() File “steps_kt/test.py”, line 40, in multi_gpu_test_simple_model parallel_model.save(“logs/model.h5”) File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py”, line 2555, in save File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/models.py”, line 107, in save_model File “/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py”, line 2396, in get_config File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 215, in _deepcopy_list append(deepcopy(a, memo)) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in <listcomp> y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 150, in deepcopy y = copier(x, memo) File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 220, in <listcomp> y = [deepcopy(a, memo) for a in x] File “/cluster/tools/python3/lib/python3.6/copy.py”, line 169, in deepcopy rv = reductor(4) TypeError: can’t pickle module objects
Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.
Thank you!
-
[ x] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
-
[ x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 20
- Comments: 59
I just faced the same issue here. In https://keras.io/utils/#multi_gpu_model it clearly stated that the model can be used like the normal model, but it cannot be saved, very funny. I can’t even perform reinforced training just because I cannot save the previous model trained with multiple GPUs. If trained with single GPU, the rest of my invested GPUs will become useless. Please urge the developer to look into this bug ASAP.
Well, none of answers above helps at all.
Take a look at: the answer from fchollet. https://github.com/fchollet/keras/issues/8446
He said “For now we recommend saving the original (template) model instead of the parallel model. I.e. call save on the model you passed to multi_gpu_model, not the model returned by it. Both models share the same weights.”
This is my example code: Please note that model -> template model gpu_model -> multi_gpu_model They are different.
I’ve found the workaround. See this StackOverflow answer for details. The code for
multi_gpu_model
:Plus, while loading the model, pass in the tensorflow object, like this:
When the bug is fixed in keras, you’ll only need to import the right
multi_gpu_model
.So I guess this should work :
I have a simplified callback for doing model checkpointing for multiple/single gpu models.
Because this inherits from
ModelCheckpoint
, you can use it in place of theModelCheckpoint
callback duringfit
/fit_generator
.Of course if you have a custom model containing another model in the second to last layer, this method is not going to do what you want.
@Weixing-Zhang answer did not work for me, I was trying to save the weights with a callback function and if it did save the weights during the training, when I was loading them, I had an error basically saying that I was trying to load 1 weight into 34. Did I do something wrong ? probably, but just a warning to everyone, if you load your weights by name, you won’t see the error I had (if you have it) and the previous answers will look like they are working (with disastrous predictions) . Anyway, here is the code I use for the callback (a mix of various previous answers), it may be of use to someone:
@D3lt4lph4 the problem is with this line in the keras code, as already discussed above:
This creates a closure for the
get_slice
lambda function, which includes the number of gpus (that’s ok) and tensorflow module (not ok). Model save tries to serialize all layers, including the ones that callget_slice
and fails exactly becausetf
is in the closure.My solution is to move import out of
multi_gpu_model
, so thattf
becomes a global object, though still needed forget_slice
to work. This fixes the problem of saving, but in loading one has to providetf
explicitly. I’m sure the last part can be done by keras itself to make it look seamless.Hi. Seems like I’ve found the solution for mine case. Just compile the base model, then transfer the trained weights of GPU model back to base model itself, then it was able to be saved like usual and perform like GPU model, walla!
The saved model can be loaded and modified as usual. Hope it helps.
I have the same problem while trying to
model.save('..').
a parallel_model. I also first discovered while using ModelCheckpoint callback to save results between epochs. However if I used the optionModelCheckpoint(..., save_weights_only=True),
to usemodel.save_weights()
it seems to work.Hi, I have tried some(/all) of the solutions above, chances are that I use them wrong, but these are unpleasant experiences. So, is there will be any OFFICIAL support for model checkpoint saving with ease? To me, this is just necessary and make no scene to leave this issue unsolved for such a long time.
@yunkchen I guess you need to compile first
Okay, so I think I know why @Weixing-Zhang solution wasn’t working with the checkpoints.
I did a little digging in the keras github and it seems that when the call to fit_generator is made, the model in the callback is set to be the model making the call to fit_generator. So even if the correct model is set beforehand when creating the callback, this will be overwritten by the multi_gpu one.
So here is the modify version of the previous Class
I guess this is a bug regarding the multi_gpu (and not a nice one to correct from what I see of the keras code).
Seems like that this issue not solved for keras 2.1.2. Refer to ChristofHenkel’s reply, model could be saved during training, but format of the saved model is not correct when load for inference.
@ParikhKadam I’m not in the keras team and I didn’t promise the fix. Based on latest activity here it has been fixed, but unfortunately I don’t have any additional information. Please ask the team.
@maxim5 You said that the issue will be solved but I don’t know if it is. Is it solved?
I am facing a similar issue with saving/loading multi_gpu_model here