tensorflow: Windows - tensorflow.python.framework.errors_impl.UnknownError: Failed to rename:
System information
OS Name: Microsoft Windows 10 Enterprise OS Version: 10.0.17763 N/A Build 17763 TensorFlow installed using ‘conda’. tensorflow v2.2.0-rc4-8-g2b96f3662b 2.2.0 Python 3.6.10 |Anaconda, Inc.| (default, Jan 7 2020, 15:18:16) [MSC v.1916 64 bit (AMD64)] on win32
Describe the current behavior
Saving checkpoint files from tensorflow is failing on Windows 10.
Traceback (most recent call last):
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Git\<redacted>\tests\integration\validate_train_model.py", line 216, in <module>
main()
File "C:\Git\<redacted>\tests\integration\validate_train_model.py", line 176, in main
fig_save_freq = fig_save_freq)
File "c:\git\<redacted>\src\pointnet\model.py", line 640, in fit
self.save_best_model()
File "c:\git\<redacted>\src\pointnet\model.py", line 493, in save_best_model
check_interval = False)
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 823, in save
self._record_state()
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 728, in _record_state
save_relative_paths=True)
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 248, in update_checkpoint_state_internal
text_format.MessageToString(ckpt))
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 532, in atomic_write_string_to_file
rename(temp_pathname, filename, overwrite)
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 491, in rename
rename_v2(oldname, newname, overwrite)
File "C:\Users\<redacted>\Miniconda3\envs\<redacted>\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 508, in rename_v2
compat.as_bytes(src), compat.as_bytes(dst), overwrite)
tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: tests\files\checkpoints\0000_00_00_00_00_00\checkpoint.tmpc6ee5d6bc5a445c884bba8c3acadf01f to: tests\files\checkpoints\0000_00_00_00_00_00\checkpoint : Access is denied.
; Input/output error
Problem traced to: tensorflow.python.lib.io.file_io, line 532, function atomic_write_string_to_file
From debugging, tensorflow attempts to create, then overwrite a file while saving a checkpoint. For some reason, the ‘overwrite’ parameter, although set to True, does nothing. This causes the rename to fail (since the file seems to get created earlier in the checkpoint save process).
We tried deleting the ‘checkpoint’ file before the ‘save’, but the checkpoint file that it’s trying to overwrite appears to be created as a part of the ‘save’ call.
I was able to get checkpoint saving working again by modifying atomic_write_string_to_file as follows. My change checks for existence of the rename target and deletes it using os.remove if overwrite is True, rather than relying on the tensorflow custom machinery that doesn’t seem to be working:
def atomic_write_string_to_file(filename, contents, overwrite=True):
if not has_atomic_move(filename):
write_string_to_file(filename, contents)
else:
temp_pathname = filename + ".tmp" + uuid.uuid4().hex
write_string_to_file(temp_pathname, contents)
try:
if overwrite and os.path.exists(filename):
os.remove(filename)
rename(temp_pathname, filename, overwrite)
except errors.OpError:
delete_file(temp_pathname)
raise
The stack trace we got suggested that this is the same issue as someone was reporting for tensorflow.models: https://github.com/tensorflow/models/issues/4177
Describe the expected behavior
We should be able to successfully save a checkpoint on Windows 10.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 9
- Comments: 27 (1 by maintainers)
I am getting this same error when trying to train a custom object detection model using TF2:
tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: training\checkpoint.tmpa7a5285bf7fa4fc1861942fc88f3e099 to: training\checkpoint : Access is denied. ; Input/output error
Just rename “checkpoint” directory to “cp” for example (remember to change paths in pipeline.config).
I am getting this error as well.
Stackoverflow thread: https://stackoverflow.com/questions/65461750/tensorflow-python-framework-errors-impl-unknownerror-failed-to-rename-access
Code:
Output
I am on Windows 10; TF 2.3; Python 3.7.9;
HEre’s conda list
Can someone please help?
I had this issue and moving my code and data outside of Dropbox directory solved the problem. (This didn’t used to happen with Dropbox, but now it does).
Failed to rename: path\trial_5a095c02600a30dc086a9efe046b1272\checkpoints\epoch_0\checkpoint_temp/part-00000-of-00001.data-00000-of-00001 to: path\trial_5a095c02600a30dc086a9efe046b1272\checkpoints\epoch_0\checkpoint.data-00000-of-00001
try to make this path is shorter less than 255 character
I was having the same access denied issue. I followed this advice and it solved the issue for me with a caveat. I couldn’t use /Temp for some reason (was getting permission denied). But when I used /ProgramData/PythonTraining/my_checkpoint all errors went away.
ps: note I had long file names enabled in registry as well prior to this and it did not help.
I can confirm problem still exists for both tensorflow 2.3 and 2.4 in Windows. I have tried all recommended solutions, including modifying the atomic_write_string_to_file function as described on top, specifying different folders for checkpoint and save, shutting down antivirus and all cloud back up services etc. But still ran into “failed to rename error” repeatedly in normal tensorflow model training.
I guess it’s time for Linux? WSL2 is premature and don’t have multiple GPU support yet. I feel that Windows is just not very loved.
Are you Running another Process while Learning Like Eval for Example
I am also getting this error on: Windows 10; TF 2.3; Python 3.7
tensorflow.python.framework.errors_impl.UnknownError: Failed to rename: path\trial_5a095c02600a30dc086a9efe046b1272\checkpoints\epoch_0\checkpoint_temp/part-00000-of-00001.data-00000-of-00001 to: path\trial_5a095c02600a30dc086a9efe046b1272\checkpoints\epoch_0\checkpoint.data-00000-of-00001 : Access is denied. ; Input/output error [Op:MergeV2Checkpoints]
Unfortunately @dtmaidenmueller workaround did not work for me. I am also using Windows 10 Enterprise, with Python 3.8.5, Tensorflow 2.3.0 and Keras-Tuner 1.0.1. I am also saving the results for visualization on TensorBoard. Tensorflow was installed without conda. The error started to appear only when I increased the number of maximum trials for the tuner from 30 to 150 (and above) and changed the python script I was using to call the keras-tuner to a function, which is now called by another python script.
Is there anyone who works on tensorflow.python.io who could comment on why, sometimes, the ‘overwrite’ flag for ‘rename’ from that subpackage does nothing? It would be cleaner to fix the underlying code than to proceed with my workaround.