tensorflow: "Could not append to the internal temporary file" when writing checkpoints to GCP during TPU training

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 9.11
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2.0.dev20200119
Python version: Python 3.7.6
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior When training a model for 2 days+ on a TPU pod and saving checkpoints to a GCP bucket, I run into tensorflow.python.framework.errors_impl.InternalError: Could not append to the internal temporary file. [Op:ReadVariableOp] when writing checkpoints. This has happened a couple of times and is fixed by acquiring a new pod, so I suspect it is due to running out of space (on the TPU host it seems)? If this is the case, is it possible to diagnose when this is happening and delete old checkpoints to avoid training crashing?

Describe the expected behavior Detect that space on the (TPU host?) is close to full and delete old checkpoints.

Standalone code to reproduce the issue Standard model run on TPU; wouldn’t be reproducible without running a model for > 2 days.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

File "/home/helen/.../estimator.py", line 233, in train_and_evaluate
    checkpoint.save(os.path.join(output_dir, "ckpt"))
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1927, in save
    file_path = self.write("%s-%d" % (file_prefix, checkpoint_number))
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1857, in write
    output = self._saver.save(file_prefix=file_prefix)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1187, in save
    file_prefix=file_prefix_tensor, object_graph_tensor=object_graph_tensor)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 1135, in _save_cached_when_graph_building
    save_op = saver.save(file_prefix)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/functional_saver.py", line 250, in save
    sharded_saves.append(saver.save(shard_prefix))
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/functional_saver.py", line 70, in save
    tensors.append(spec.tensor)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/saveable_object.py", line 55, in tensor
    return self._tensor() if callable(self._tensor) else self._tensor
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/training/saving/saveable_object_util.py", line 91, in f
    x = v.read_value()
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 638, in read_value
    value = self._read_variable_op()
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 616, in _read_variable_op
    self._dtype)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/ops/gen_resource_variable_ops.py", line 479, in read_variable_op
    _ops.raise_from_not_ok_status(e, name)
  File "/home/helen/anaconda3/envs/.../python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6625, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Could not append to the internal temporary file. [Op:ReadVariableOp]

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (10 by maintainers)

Most upvoted comments

yup the local GCP VM (the one where we store the core and requesting the tpu etc).

For me I’ve tried to request another TPU pod as previous recommended, yet it was not working. But It’s possible that this error can be caused by multiple reasons?

crystina-z on Oct 26, 2020

Yes, this only happens on pods.

mathemakitten on Oct 20, 2020