keras: model.save() throwed out an OSError: Unable to create file(error message = 'resource temporarily unavailable')
Well, I build a keras model, and since sometimes my dataset is too large to fit into the memory, and a memoryError
throwed out. Thus, I searched and figured out that I need to implement a generator class inherited from keras.utils.Sequence
, so that I can use model.fit_generator
and model.predict_generator
.
And I have several callbacks in my fit_generator
function, which includes ModelCheckpoint
to save my model as a .hdf5
file, and I use use_multiprocessing=True, workers=16
in my fig_generator` function.
Here is a snippet of this function:
self.model.fit_generator(generator=training_generator,
validation_data=validation_generator,
epochs=epoch,
use_multiprocessing=True,
workers=8,
callbacks=monitor,
verbose=2)
The error message is attached below:
Epoch 00005: val_loss did not improve
Epoch 6/15
- 135s - loss: 52.3622 - val_loss: 74.5698
Epoch 00006: val_loss improved from 74.99819 to 74.56982, saving model to models/002008.SZ/model.hdf5
Epoch 7/15
- 135s - loss: 52.3163 - val_loss: 74.2776
Epoch 00007: val_loss improved from 74.56982 to 74.27758, saving model to models/002008.SZ/model.hdf5
Traceback (most recent call last):
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/root/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 100, in <module>
main()
File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 72, in main
model.train(epoch=15)
File "/data/CNN_dlw_v0/CNN_dlw/Model.py", line 383, in train
verbose=2)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/training.py", line 2280, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 447, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/topology.py", line 2576, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/models.py", line 106, in save_model
with h5py.File(filepath, mode='w') as f:
File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 271, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 107, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (Unable to lock file, errno = 11, error message = 'resource temporarily unavailable')
^C
Process finished with exit code 1
Notice that sometimes the model is successfully saved, but it seems to me that it’s a probability thing, after several save action, it will be more probable to cause this OSError.
I think it could be that I set the use_multiprocessing=True
and workers=16
which makes some process is trying to save the file while another process is also trying to access this same file? I’m not quite sure exactly what happens here. I think the keras should have some internal control which would prevent this.
Edit: When I set use_multiprocessing=False
, this OSError
stopped from popping up, however, since no multiprocessing, the training procesure is much slower now. One not so elegant solution I can think of is not using ModelCheckpoint, and only save the model after training so I can still use multiprocessing in training, but in this way I will not be able to save the model which corresponds to the one with smallest val_loss
. :<
Edit2: I checked that my keras version is 2.1.4
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 29
- Comments: 75 (6 by maintainers)
Commits related to this issue
- Classification will not resume training if it fails between epochs. This is because of the issue https://github.com/keras-team/keras/issues/11101 — committed to mattyws/multiclassification by deleted user 5 years ago
Hi,
I have just solved this issue by uninstalling h5py 2.8.0 and re-installing h5py 2.7.1. Now, it does work with keras 2.2.4 and tensorflow-gpu 1.12.0. Please, crosscheck it for other keras and tensorflow versions.
In my case, I encountered this issue when moving my python scripts to another machine. And indeed, as I did blankly the re-installation of all my python libraries without taking care of the individual versions, this error appeared when checkpointing my models:
OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
As it wasn’t crashing on the previous machine, I began trying all the tips in this post but without any success. Until I found that the only difference in my previous setup was the h5py library’s version.
TIP: if it does continue to crash after the backward installation of h5py 2.7.1, erase your previous *.hdf5 files that you wish to re-write and it should work.
We’re experiencing the same issue using the “model_checkpoint” callback, we also have use_multiprocessing=True
It happens after the second epoch
Yeh, this is super frustrating. I’ve tried all of the above (threads, workers, unique names, saving just once…) All of this fails on my first attempt to save weights (but only to a specific mount).
Having to implement a different checkpoint class to work around this means this is properly broken. Dear Keras devs, I know this one might be hard to reproduce (I think NAS mounts may be a commonality - a race condition perhaps) please believe us. This is a real problem.
I hope you find a good solution.
I noticed that issue disappears when I set:
If you want to keep the standard behaviour of the ModelCheckpointer, use this patched version. It simply catches any error during saving and keeps trying to save the model every 5 seconds.
I’m still seeing this problem in Keras 2.2.4 using fit_generator with a Sequence class to serve up the training and validation data instead of a generator. I’ve been getting around the problem by not using ModelCheckpoint at all. Instead I use the EarlyStopping callback with restore_best_weights=True - and then I save the best weights after the model finishes training. Hope this helps!
Hello, I was experiencing this issue and the only way to solve it was adding the following code at the beginning of the file. I found this solution here https://groups.google.com/g/h5py/c/0kgiMVGSTBE
Hi, I have the same problem. I tried to debug a bit but didn’t manage to find where the problem is. For whatever reason the .hdf5 file is not fully closed or a lock related to h5py lib is still on when the callback of the next epoch tries to save again the weights. Apparently people have problems when using h5py with locks. I found a temporary solution for now and it seems to work. Add in your script this:
export HDF5_USE_FILE_LOCKING=FALSE
I think @ale152 's reply deserves more attention, it’s quite a workaround. Nice work!
Hi, I solved this issue by using a generic filename:
the example given in keras is:
Make sure though that you have enough space on your harddisk in case the models are very big 😉.
Same problem with keras 2.3.1, tensorflow-gpu 2.1.0 and h5py 2.8.0. Any new solution??
The problem of mine was caused by having a ‘:’ in my h5 file’s name, remove ‘:’ and the problem just disappeared.
I have been struggling with this issue as well with multiprocessing. I have found that after 30ish epochs with
use_multiprocessing=True
andModelCheckpoint(filepath='weights.best.hdf5', save_best_only=True)
on the same kernel in a jupyter notebook I get theOSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
error.I’ve tried as few as 7 workers and as much as 31 workers. I also have tried deleting old *.hdf5 files. If I restart the jupyter notebook kernel, it works for another 30ish epochs, then errors out.
Keras 2.2.4 tensorflow-gpu 1.12.0 h5py 2.8.0
Same issue with Keras 2.2.4 and Tensorflow-gpu 1.11.0.
As a workaround you can use the formatting options of the filepath described in https://keras.io/callbacks/#modelcheckpoint in combination with
save_best_only=True
. This way you will end up with multiple saved models, but the last one will be the best one.Same problem here, after the second epoch the model cannot be saved.
I’m using Keras 2.2.2, Tensorflow 1.10.1 and h5py 2.8.0, multiprocessing=True and fit_generator, with the only callback of ModelCheckpoint
I am having this same issue using fit_generator with use_multiprocessing=True and callbacks=ModelCheckpoint. I’m using keras 2.2.2.
It seems that we need to create all the parent directories before using the ModelCheckpoint callback. It works with tensorflow-gpu 2.1.0 on my machine (Ubuntu 16.04).
I have the same problem. Keras 2.2.5, h5py 2.9.0, tensorflow 1.12.0. Still no solution from keras team ?
ModelCheckpointLight.txt
Hey, sure. I copied/pasted the
ModelCheckpoint
class and modified only the saving part.Same here, when cpu_counts is large: to be more specific, when the cpu_counts = 48, this appears, and never appear when cpu_counts = 12 Keras 2.2.4 tensorflow-gpu 1.12.0 h5py 2.8.0
I’m facing same issue with below libraries:
But if ModelCheckpoint’s
save_best_only
is `False, It has never been occured in my environment at least.I have the same issue with the Keras on 2.2.4. I am using Ubuntu 18.10, installed Keras and tensorflow with Anaconda, and am using multi_processing to speed up training, but there is usually this same error after the second or third epoch with training.
Interestingly, my problem also gets solved by changing to keras 2.2.0: https://github.com/keras-team/keras/issues/10948#issuecomment-423767210
However, I still have the resource temporarily unavailable issue a lot too. I usually resort to either only saving the last model, or saving to a new file each time.