keras: model.save() throwed out an OSError: Unable to create file(error message = 'resource temporarily unavailable')

Well, I build a keras model, and since sometimes my dataset is too large to fit into the memory, and a memoryError throwed out. Thus, I searched and figured out that I need to implement a generator class inherited from keras.utils.Sequence, so that I can use model.fit_generator and model.predict_generator.

And I have several callbacks in my fit_generator function, which includes ModelCheckpoint to save my model as a .hdf5 file, and I use use_multiprocessing=True, workers=16 in my fig_generator` function. Here is a snippet of this function:

        self.model.fit_generator(generator=training_generator,
                                 validation_data=validation_generator,
                                 epochs=epoch,
                                 use_multiprocessing=True,
                                 workers=8,
                                 callbacks=monitor,
                                 verbose=2)

The error message is attached below:


Epoch 00005: val_loss did not improve
Epoch 6/15
 - 135s - loss: 52.3622 - val_loss: 74.5698

Epoch 00006: val_loss improved from 74.99819 to 74.56982, saving model to models/002008.SZ/model.hdf5
Epoch 7/15
 - 135s - loss: 52.3163 - val_loss: 74.2776

Epoch 00007: val_loss improved from 74.56982 to 74.27758, saving model to models/002008.SZ/model.hdf5
Traceback (most recent call last):
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/root/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 100, in <module>
    main()
  File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 72, in main
    model.train(epoch=15)
  File "/data/CNN_dlw_v0/CNN_dlw/Model.py", line 383, in train
    verbose=2)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/training.py", line 2280, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 447, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/topology.py", line 2576, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/models.py", line 106, in save_model
    with h5py.File(filepath, mode='w') as f:
  File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 271, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 107, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (Unable to lock file, errno = 11, error message = 'resource temporarily unavailable')
^C
Process finished with exit code 1

Notice that sometimes the model is successfully saved, but it seems to me that it’s a probability thing, after several save action, it will be more probable to cause this OSError.

I think it could be that I set the use_multiprocessing=True and workers=16 which makes some process is trying to save the file while another process is also trying to access this same file? I’m not quite sure exactly what happens here. I think the keras should have some internal control which would prevent this.

Edit: When I set use_multiprocessing=False, this OSError stopped from popping up, however, since no multiprocessing, the training procesure is much slower now. One not so elegant solution I can think of is not using ModelCheckpoint, and only save the model after training so I can still use multiprocessing in training, but in this way I will not be able to save the model which corresponds to the one with smallest val_loss. :<

Edit2: I checked that my keras version is 2.1.4

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 29
Comments: 75 (6 by maintainers)

Commits related to this issue

Classification will not resume training if it fails between epochs. This is because of the issue https://github.com/keras-team/keras/issues/11101 — committed to mattyws/multiclassification by deleted user 5 years ago

Most upvoted comments

Hi,

I have just solved this issue by uninstalling h5py 2.8.0 and re-installing h5py 2.7.1. Now, it does work with keras 2.2.4 and tensorflow-gpu 1.12.0. Please, crosscheck it for other keras and tensorflow versions.

In my case, I encountered this issue when moving my python scripts to another machine. And indeed, as I did blankly the re-installation of all my python libraries without taking care of the individual versions, this error appeared when checkpointing my models: OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

As it wasn’t crashing on the previous machine, I began trying all the tips in this post but without any success. Until I found that the only difference in my previous setup was the h5py library’s version.

TIP: if it does continue to crash after the backward installation of h5py 2.7.1, erase your previous *.hdf5 files that you wish to re-write and it should work.

+27

GLambard on Nov 29, 2018

We’re experiencing the same issue using the “model_checkpoint” callback, we also have use_multiprocessing=True

It happens after the second epoch

+19

remomomo on Sep 13, 2018

Yeh, this is super frustrating. I’ve tried all of the above (threads, workers, unique names, saving just once…) All of this fails on my first attempt to save weights (but only to a specific mount).

Having to implement a different checkpoint class to work around this means this is properly broken. Dear Keras devs, I know this one might be hard to reproduce (I think NAS mounts may be a commonality - a race condition perhaps) please believe us. This is a real problem.

I hope you find a good solution.

+13

evolu8 on May 13, 2019

I noticed that issue disappears when I set:

workers = multiprocessing.cpu_count() - 1

+13

spsancti on Nov 27, 2018

If you want to keep the standard behaviour of the ModelCheckpointer, use this patched version. It simply catches any error during saving and keeps trying to save the model every 5 seconds.

import numpy as np
import warnings
from time import sleep
from keras.callbacks import Callback


class PatchedModelCheckpoint(Callback):
    """Save the model after every epoch.
    `filepath` can contain named formatting options,
    which will be filled with the values of `epoch` and
    keys in `logs` (passed in `on_epoch_end`).
    For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
    then the model checkpoints will be saved with the epoch number and
    the validation loss in the filename.
    # Arguments
        filepath: string, path to save the model file.
        monitor: quantity to monitor.
        verbose: verbosity mode, 0 or 1.
        save_best_only: if `save_best_only=True`,
            the latest best model according to
            the quantity monitored will not be overwritten.
        save_weights_only: if True, then only the model's weights will be
            saved (`model.save_weights(filepath)`), else the full model
            is saved (`model.save(filepath)`).
        mode: one of {auto, min, max}.
            If `save_best_only=True`, the decision
            to overwrite the current save file is made
            based on either the maximization or the
            minimization of the monitored quantity. For `val_acc`,
            this should be `max`, for `val_loss` this should
            be `min`, etc. In `auto` mode, the direction is
            automatically inferred from the name of the monitored quantity.
        period: Interval (number of epochs) between checkpoints.
    """

    def __init__(self, filepath, monitor='val_loss', verbose=0,
                 save_best_only=False, save_weights_only=False,
                 mode='auto', period=1):
        super(PatchedModelCheckpoint, self).__init__()
        self.monitor = monitor
        self.verbose = verbose
        self.filepath = filepath
        self.save_best_only = save_best_only
        self.save_weights_only = save_weights_only
        self.period = period
        self.epochs_since_last_save = 0

        if mode not in ['auto', 'min', 'max']:
            warnings.warn('ModelCheckpoint mode %s is unknown, '
                          'fallback to auto mode.' % (mode),
                          RuntimeWarning)
            mode = 'auto'

        if mode == 'min':
            self.monitor_op = np.less
            self.best = np.Inf
        elif mode == 'max':
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
                self.monitor_op = np.greater
                self.best = -np.Inf
            else:
                self.monitor_op = np.less
                self.best = np.Inf

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.epochs_since_last_save += 1
        if self.epochs_since_last_save >= self.period:
            self.epochs_since_last_save = 0
            filepath = self.filepath.format(epoch=epoch + 1, **logs)
            if self.save_best_only:
                current = logs.get(self.monitor)
                if current is None:
                    warnings.warn('Can save best model only with %s available, '
                                  'skipping.' % (self.monitor), RuntimeWarning)
                else:
                    if self.monitor_op(current, self.best):
                        if self.verbose > 0:
                            print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
                                  ' saving model to %s'
                                  % (epoch + 1, self.monitor, self.best,
                                     current, filepath))
                        self.best = current
                        
                        saved_correctly = False
                        while not saved_correctly:
                            try:
                                if self.save_weights_only:
                                    self.model.save_weights(filepath, overwrite=True)
                                else:
                                    self.model.save(filepath, overwrite=True)
                                saved_correctly = True
                            except Exception as error:
                                print('Error while trying to save the model: {}.\nTrying again...'.format(error))
                                sleep(5)
                    else:
                        if self.verbose > 0:
                            print('\nEpoch %05d: %s did not improve from %0.5f' %
                                  (epoch + 1, self.monitor, self.best))
            else:
                if self.verbose > 0:
                    print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
                saved_correctly = False
                while not saved_correctly:
                    try:
                        if self.save_weights_only:
                            self.model.save_weights(filepath, overwrite=True)
                        else:
                            self.model.save(filepath, overwrite=True)
                        saved_correctly = True
                    except Exception as error:
                        print('Error while trying to save the model: {}.\nTrying again...'.format(error))
                        sleep(5)

ale152 on Jun 14, 2019

I’m still seeing this problem in Keras 2.2.4 using fit_generator with a Sequence class to serve up the training and validation data instead of a generator. I’ve been getting around the problem by not using ModelCheckpoint at all. Instead I use the EarlyStopping callback with restore_best_weights=True - and then I save the best weights after the model finishes training. Hope this helps!

small-yellow-duck on Nov 7, 2018

Hello, I was experiencing this issue and the only way to solve it was adding the following code at the beginning of the file. I found this solution here https://groups.google.com/g/h5py/c/0kgiMVGSTBE

import os
os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"

xehartnort on Jun 16, 2021

Hi, I have the same problem. I tried to debug a bit but didn’t manage to find where the problem is. For whatever reason the .hdf5 file is not fully closed or a lock related to h5py lib is still on when the callback of the next epoch tries to save again the weights. Apparently people have problems when using h5py with locks. I found a temporary solution for now and it seems to work. Add in your script this: export HDF5_USE_FILE_LOCKING=FALSE

pambros on Jan 31, 2019

If you want to keep the standard behaviour of the ModelCheckpointer, use this patched version. It simply catches any error during saving and keeps trying to save the model every 5 seconds.

import numpy as np
import warnings
from time import sleep
from keras.callbacks import Callback


class PatchedModelCheckpoint(Callback):
    """Save the model after every epoch.
    `filepath` can contain named formatting options,
    which will be filled with the values of `epoch` and
    keys in `logs` (passed in `on_epoch_end`).
    For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
    then the model checkpoints will be saved with the epoch number and
    the validation loss in the filename.
    # Arguments
        filepath: string, path to save the model file.
        monitor: quantity to monitor.
        verbose: verbosity mode, 0 or 1.
        save_best_only: if `save_best_only=True`,
            the latest best model according to
            the quantity monitored will not be overwritten.
        save_weights_only: if True, then only the model's weights will be
            saved (`model.save_weights(filepath)`), else the full model
            is saved (`model.save(filepath)`).
        mode: one of {auto, min, max}.
            If `save_best_only=True`, the decision
            to overwrite the current save file is made
            based on either the maximization or the
            minimization of the monitored quantity. For `val_acc`,
            this should be `max`, for `val_loss` this should
            be `min`, etc. In `auto` mode, the direction is
            automatically inferred from the name of the monitored quantity.
        period: Interval (number of epochs) between checkpoints.
    """

    def __init__(self, filepath, monitor='val_loss', verbose=0,
                 save_best_only=False, save_weights_only=False,
                 mode='auto', period=1):
        super(PatchedModelCheckpoint, self).__init__()
        self.monitor = monitor
        self.verbose = verbose
        self.filepath = filepath
        self.save_best_only = save_best_only
        self.save_weights_only = save_weights_only
        self.period = period
        self.epochs_since_last_save = 0

        if mode not in ['auto', 'min', 'max']:
            warnings.warn('ModelCheckpoint mode %s is unknown, '
                          'fallback to auto mode.' % (mode),
                          RuntimeWarning)
            mode = 'auto'

        if mode == 'min':
            self.monitor_op = np.less
            self.best = np.Inf
        elif mode == 'max':
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
                self.monitor_op = np.greater
                self.best = -np.Inf
            else:
                self.monitor_op = np.less
                self.best = np.Inf

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.epochs_since_last_save += 1
        if self.epochs_since_last_save >= self.period:
            self.epochs_since_last_save = 0
            filepath = self.filepath.format(epoch=epoch + 1, **logs)
            if self.save_best_only:
                current = logs.get(self.monitor)
                if current is None:
                    warnings.warn('Can save best model only with %s available, '
                                  'skipping.' % (self.monitor), RuntimeWarning)
                else:
                    if self.monitor_op(current, self.best):
                        if self.verbose > 0:
                            print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
                                  ' saving model to %s'
                                  % (epoch + 1, self.monitor, self.best,
                                     current, filepath))
                        self.best = current
                        
                        saved_correctly = False
                        while not saved_correctly:
                            try:
                                if self.save_weights_only:
                                    self.model.save_weights(filepath, overwrite=True)
                                else:
                                    self.model.save(filepath, overwrite=True)
                                saved_correctly = True
                            except Exception as error:
                                print('Error while trying to save the model: {}.\nTrying again...'.format(error))
                                sleep(5)
                    else:
                        if self.verbose > 0:
                            print('\nEpoch %05d: %s did not improve from %0.5f' %
                                  (epoch + 1, self.monitor, self.best))
            else:
                if self.verbose > 0:
                    print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
                saved_correctly = False
                while not saved_correctly:
                    try:
                        if self.save_weights_only:
                            self.model.save_weights(filepath, overwrite=True)
                        else:
                            self.model.save(filepath, overwrite=True)
                        saved_correctly = True
                    except Exception as error:
                        print('Error while trying to save the model: {}.\nTrying again...'.format(error))
                        sleep(5)

I think @ale152 's reply deserves more attention, it’s quite a workaround. Nice work!

gergopool on Aug 11, 2020

Hi, I solved this issue by using a generic filename:

the example given in keras is:

For example: if filepath is weights.{epoch:02d}-{val_loss:.2f}.hdf5, then the model checkpoints will be saved with the epoch number and the validation loss in the filename.

Make sure though that you have enough space on your harddisk in case the models are very big 😉.

yeus on May 7, 2019

Same problem with keras 2.3.1, tensorflow-gpu 2.1.0 and h5py 2.8.0. Any new solution??

JefferyW on Mar 8, 2020

The problem of mine was caused by having a ‘:’ in my h5 file’s name, remove ‘:’ and the problem just disappeared.

Patrickctyyx on Nov 23, 2019

I have been struggling with this issue as well with multiprocessing. I have found that after 30ish epochs with use_multiprocessing=True and ModelCheckpoint(filepath='weights.best.hdf5', save_best_only=True) on the same kernel in a jupyter notebook I get the OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') error.

I’ve tried as few as 7 workers and as much as 31 workers. I also have tried deleting old *.hdf5 files. If I restart the jupyter notebook kernel, it works for another 30ish epochs, then errors out.

Keras 2.2.4 tensorflow-gpu 1.12.0 h5py 2.8.0

SGStrohkorb on Jan 30, 2019

Same issue with Keras 2.2.4 and Tensorflow-gpu 1.11.0.

As a workaround you can use the formatting options of the filepath described in https://keras.io/callbacks/#modelcheckpoint in combination with save_best_only=True. This way you will end up with multiple saved models, but the last one will be the best one.

paloha on Oct 8, 2018

Same problem here, after the second epoch the model cannot be saved.

I’m using Keras 2.2.2, Tensorflow 1.10.1 and h5py 2.8.0, multiprocessing=True and fit_generator, with the only callback of ModelCheckpoint

jliebana on Sep 24, 2018

I am having this same issue using fit_generator with use_multiprocessing=True and callbacks=ModelCheckpoint. I’m using keras 2.2.2.

small-yellow-duck on Sep 11, 2018

It seems that we need to create all the parent directories before using the ModelCheckpoint callback. It works with tensorflow-gpu 2.1.0 on my machine (Ubuntu 16.04).

HenryNebula on May 7, 2020

I have the same problem. Keras 2.2.5, h5py 2.9.0, tensorflow 1.12.0. Still no solution from keras team ?

georgesmatar on Sep 10, 2019

ModelCheckpointLight.txt

Hey, sure. I copied/pasted the ModelCheckpoint class and modified only the saving part.

pambros on Mar 16, 2019

Same here, when cpu_counts is large: to be more specific, when the cpu_counts = 48, this appears, and never appear when cpu_counts = 12 Keras 2.2.4 tensorflow-gpu 1.12.0 h5py 2.8.0

ziyigogogo on Dec 19, 2018

I’m facing same issue with below libraries:

Keras 2.2.4
tensorflow-gpu 1.12.0
h5py 2.8.0

But if ModelCheckpoint’s save_best_only is `False, It has never been occured in my environment at least.

keisen on Dec 4, 2018

I have the same issue with the Keras on 2.2.4. I am using Ubuntu 18.10, installed Keras and tensorflow with Anaconda, and am using multi_processing to speed up training, but there is usually this same error after the second or third epoch with training.

jacobbieker on Nov 14, 2018

Interestingly, my problem also gets solved by changing to keras 2.2.0: https://github.com/keras-team/keras/issues/10948#issuecomment-423767210

However, I still have the resource temporarily unavailable issue a lot too. I usually resort to either only saving the last model, or saving to a new file each time.

daquang on Sep 28, 2018