tensorflow: Error when saving weights in h5 format for layer with nested layers
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.4
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.2.0
- Python version: 3.7
Describe the current behavior
Given a layer with nested layers, when trying to save weights in h5 format, fails with RuntimeError: Unable to create link (name already exists)
Standalone code to reproduce the issue
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
class NestedLayers(tf.keras.layers.Layer):
def __init__(self):
super(NestedLayers, self).__init__()
self.units = [layers.Conv2D(16, (3,3), name="conv_2d_0"),
layers.Conv2D(16, (3,3), name="conv_2d_1"),
layers.Conv2D(16, (3,3), name="conv_2d_2")]
def build(self, input_shape):
for i in range(0,2):
unit_input_shape = list(input_shape)
unit_input_shape[-1] = 1
unit = self.units[i]
unit.build(unit_input_shape)
def call(self, inputs):
split_inputs = tf.split(value=inputs,
num_or_size_splits=3,
axis=-1,
name="conv_grp_split")
outputs = []
for i in range(0,2):
out = self.units[i](split_inputs[i])
outputs.append(out)
out = tf.keras.layers.concatenate(outputs, axis=-1, name="conv_grp_concat")
return out
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.InputLayer(input_shape=(32, 32, 3)))
model.add(NestedLayers())
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))
model.summary()
import datetime
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
check_pt = tf.keras.callbacks.ModelCheckpoint(
os.path.join(log_dir, "model.ckpt.{epoch:04d}-{val_loss:.06f}.hdf5"),
monitor='val_loss',
verbose=1,
save_best_only=False,
save_weights_only=True,
mode='max',
period=1)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=2,
validation_data=(test_images, test_labels),
callbacks = [check_pt])
Fails with the following: Epoch 00001: saving model to logs/fit/20200702-114230/model.ckpt.0001-1.949398.hdf5 Traceback (most recent call last): File “/nfs/site/home/tkrimer/work/mbe/dlo/src/internal_utils/train_cifar10_tf_2/nested_layers_save.py”, line 78, in <module> callbacks = [tensorboard_callback, check_pt]) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py”, line 66, in _method_wrapper return method(self, *args, **kwargs) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py”, line 876, in fit callbacks.on_epoch_end(epoch, epoch_logs) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py”, line 365, in on_epoch_end callback.on_epoch_end(epoch, logs) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py”, line 1177, in on_epoch_end self._save_model(epoch=epoch, logs=logs) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py”, line 1223, in _save_model self.model.save_weights(filepath, overwrite=True) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/engine/network.py”, line 1151, in save_weights hdf5_format.save_weights_to_hdf5_group(f, self.layers) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/tensorflow/python/keras/saving/hdf5_format.py”, line 639, in save_weights_to_hdf5_group param_dset = g.create_dataset(name, val.shape, dtype=val.dtype) File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/h5py/_hl/group.py”, line 139, in create_dataset self[name] = dset File “/localdrive/users/tkrimer/venv/tf_2.1/lib/python3.7/site-packages/h5py/_hl/group.py”, line 373, in setitem h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl) File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper File “h5py/h5o.pyx”, line 202, in h5py.h5o.link RuntimeError: Unable to create link (name already exists)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 28 (4 by maintainers)
Commits related to this issue
- Implemented workaround for TF issue https://github.com/tensorflow/tensorflow/issues/41021#issuecomment-786715361 — committed to fantauzzi/chest_x-ray by fantauzzi 3 years ago
Similar bug still exists in TensorFlow 2.4.0.
I found a workaround: Forcing the internal layers weights to be named uniquely with a name scope. Check this out:
Also, here is a gist
The same error can be obtained this way:
fit()on the modelfit()on the second model, the one instantiated at point 3)What happens is that after step 4), the Adam optimizer instance has in its
weightsattribute a list of 9 TF variables, instead of 5. Four of the variables have been duplicated, keeping the same respective name. When trying to save the mode in .h5 format, thetf.keras.models.save_model()function finds two variables with duplicated nameAdam/dense/kernel/m:0, and stops with an error.In this case, the error can be resolved by ensuring the same instance of Adam optimizer is not used to compile more than one model.
I also had this same issue when creating a model from an existing one the following way
And adding a name to the new Conv2D layer also solved it