tensorflow: InvalidArgumentError for save/restore of variables (same version, same OS, same directory)

I get an InvalidArgumentError with no further information when I try to save and then restore parts of my model later to continue training it (due to needing my laptop for class).

Initialization: saver = tf.train.Saver({“embeddings”: embeddings, “weights”: nce_weights, “biases”: nce_biases})

Save: saver.save(sess, model_checkpoint_path)

Load: saver.restore(sess, model_checkpoint_path)

2018-04-21 22:45:00.143245: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
Traceback (most recent call last):
  File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

... <contains sensitive info> ...

InvalidArgumentError (see above for traceback): /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Clarification requested by tensorflowbutler Have I written custom code: Yes, I modified this code (https://github.com/PacktPublishing/TensorFlow-Machine-Learning-Cookbook/blob/master/Chapter 07/doc2vec.py) to work with TensorFlow 1.7 and to use the same embeddings variable for documents as for words with average instead of concatenation. I also updated the saved variables to include nce_weights and nce_biases so that training may be resumed. OS Platform and Distribution MacOS 10.13.4 (17E199) TensorFlow installed from pip on VirtualEnv, according to instructions (https://www.tensorflow.org/install/install_mac) TensorFlow version 1.7 Bazel version NA CUDA/cuDNN version NA GPU model and memory NA Exact command to reproduce saver = tf.train.Saver({“embeddings”: embeddings, “weights”: nce_weights, “biases”: nce_biases}) saver.restore(sess, “…/trained_model/saved_stuff”)

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 31 (12 by maintainers)

Most upvoted comments

I think the root cause may be: in https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/core/platform/hadoop/hadoop_file_system.cc line 213:

tSize r = hdfs_->hdfsPread(fs_, file_, static_cast<tOffset>(offset), dst, static_cast<tSize>(n));

the last param static_cast<tSize>(n) means the number of bytes required. ie., the tensor’s size. in https://github.com/tensorflow/tensorflow/blob/r1.7/third_party/hadoop/hdfs.h line 75,

typedef int32_t tSize; /// size of data for read/write io ops

tSize is int32, but tensor’s size is int64, when the tensor is big enough, overflow occurs.

My solution is to use partitioned variables for large tensor, then problem solved.

madaoLic on Jun 11, 2018

Sorry! It’s definitely fixed. I ran:

pip uninstall tensorflow
pip install tf-nightly

And now it works perfectly. Sorry about that!

Radilx on May 23, 2019

Finally got a chance to debug this on a mac. Apparently the pread system call, despite taking an eight-byte size_t for its nbytes argument, returns EINVAL if “The sum of the iov_len values in the iov array overflowed a 32-bit integer.” And presumably pread is implemented in terms of readv so they have the same limitation.

I have a change out for review which just limits reads to INT32_MAX on every platform. Seems to work if we do that. I checked that the checkpoints themselves were identical to what gets written on Linux, so existing checkpoints will start working once that change is in.

allenlavoie on May 8, 2019

The issue is still there

shkarupa-alex on Jan 21, 2019