tensorflow: "Could not append to the internal temporary file" when writing checkpoints to GCP during TPU training
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 9.11
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.2.0.dev20200119
- Python version: Python 3.7.6
- Bazel version (if compiling from source): N/A
- GCC/Compiler version (if compiling from source): N/A
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
When training a model for 2 days+ on a TPU pod and saving checkpoints to a GCP bucket, I run into tensorflow.python.framework.errors_impl.InternalError: Could not append to the internal temporary file. [Op:ReadVariableOp]
when writing checkpoints. This has happened a couple of times and is fixed by acquiring a new pod, so I suspect it is due to running out of space (on the TPU host it seems)? If this is the case, is it possible to diagnose when this is happening and delete old checkpoints to avoid training crashing?
Describe the expected behavior Detect that space on the (TPU host?) is close to full and delete old checkpoints.
Standalone code to reproduce the issue Standard model run on TPU; wouldn’t be reproducible without running a model for > 2 days.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
File "/home/helen/.../estimator.py", line 233, in train_and_evaluate
checkpoint.save(os.path.join(output_dir, "ckpt"))
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1927, in save
file_path = self.write("%s-%d" % (file_prefix, checkpoint_number))
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1857, in write
output = self._saver.save(file_prefix=file_prefix)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1187, in save
file_prefix=file_prefix_tensor, object_graph_tensor=object_graph_tensor)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1135, in _save_cached_when_graph_building
save_op = saver.save(file_prefix)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/functional_saver.py", line 250, in save
sharded_saves.append(saver.save(shard_prefix))
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/functional_saver.py", line 70, in save
tensors.append(spec.tensor)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/saveable_object.py", line 55, in tensor
return self._tensor() if callable(self._tensor) else self._tensor
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/saveable_object_util.py", line 91, in f
x = v.read_value()
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 638, in read_value
value = self._read_variable_op()
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 616, in _read_variable_op
self._dtype)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/gen_resource_variable_ops.py", line 479, in read_variable_op
_ops.raise_from_not_ok_status(e, name)
File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6625, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Could not append to the internal temporary file. [Op:ReadVariableOp]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (10 by maintainers)
yup the local GCP VM (the one where we store the core and requesting the tpu etc).
For me I’ve tried to request another TPU pod as previous recommended, yet it was not working. But It’s possible that this error can be caused by multiple reasons?
Yes, this only happens on pods.