tensorflow: InvalidArgumentError for save/restore of variables (same version, same OS, same directory)
I get an InvalidArgumentError with no further information when I try to save and then restore parts of my model later to continue training it (due to needing my laptop for class).
Initialization: saver = tf.train.Saver({“embeddings”: embeddings, “weights”: nce_weights, “biases”: nce_biases})
Save: saver.save(sess, model_checkpoint_path)
Load: saver.restore(sess, model_checkpoint_path)
2018-04-21 22:45:00.143245: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
Traceback (most recent call last):
File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/Users/nroth/tf_python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
... <contains sensitive info> ...
InvalidArgumentError (see above for traceback): /Users/nroth/Documents/****/trained_model/****embeddings.ckpt.data-00000-of-00001; Invalid argument
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Clarification requested by tensorflowbutler Have I written custom code: Yes, I modified this code (https://github.com/PacktPublishing/TensorFlow-Machine-Learning-Cookbook/blob/master/Chapter 07/doc2vec.py) to work with TensorFlow 1.7 and to use the same embeddings variable for documents as for words with average instead of concatenation. I also updated the saved variables to include nce_weights and nce_biases so that training may be resumed. OS Platform and Distribution MacOS 10.13.4 (17E199) TensorFlow installed from pip on VirtualEnv, according to instructions (https://www.tensorflow.org/install/install_mac) TensorFlow version 1.7 Bazel version NA CUDA/cuDNN version NA GPU model and memory NA Exact command to reproduce saver = tf.train.Saver({“embeddings”: embeddings, “weights”: nce_weights, “biases”: nce_biases}) saver.restore(sess, “…/trained_model/saved_stuff”)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 31 (12 by maintainers)
I think the root cause may be: in https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/core/platform/hadoop/hadoop_file_system.cc line 213:
tSize r = hdfs_->hdfsPread(fs_, file_, static_cast<tOffset>(offset), dst, static_cast<tSize>(n));the last param static_cast<tSize>(n) means the number of bytes required. ie., the tensor’s size. in https://github.com/tensorflow/tensorflow/blob/r1.7/third_party/hadoop/hdfs.h line 75,
typedef int32_t tSize; /// size of data for read/write io ops
tSize is int32, but tensor’s size is int64, when the tensor is big enough, overflow occurs.
My solution is to use partitioned variables for large tensor, then problem solved.
Sorry! It’s definitely fixed. I ran:
And now it works perfectly. Sorry about that!
Finally got a chance to debug this on a mac. Apparently the
preadsystem call, despite taking an eight-bytesize_tfor itsnbytesargument, returnsEINVALif “The sum of the iov_len values in the iov array overflowed a 32-bit integer.” And presumablypreadis implemented in terms ofreadvso they have the same limitation.I have a change out for review which just limits reads to INT32_MAX on every platform. Seems to work if we do that. I checked that the checkpoints themselves were identical to what gets written on Linux, so existing checkpoints will start working once that change is in.
The issue is still there