tensorflow: FailedPreconditionError when restoring initializable_iterator with Scaffold in a MonitoredTrainingSession for the second time.
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): +
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): v1.3.0-rc2-20-g0787eee 1.3.0
- Python version: Python 3.5.1
- Bazel version (if compiling from source): -
- CUDA/cuDNN version: -
- GPU model and memory: -
- Exact command to reproduce: -
Context:
Using initializable_iterator with MonitoredTrainingSession because there are stateful lookup_ops.index_table_from_tensor() lookup tables that don’t work with one_shot_iterator.
initializable_iterator is initialized with a tf.train.Scaffold():
Scaffold = tf.train.Scaffold(
init_op=control_flow_ops.group(variables.global_variables_initializer(),
resources.initialize_resources(resources.shared_resources()),
iter_init_op))
with tf.train.MonitoredTrainingSession(
master=server.target,
is_chief=hps.is_chief,
scaffold=Scaffold,
config=config,
checkpoint_dir=hps.checkpoint_dir,
hooks=hooks
) as mon_sess:
...
Where iter_init_op is equivalent to iterator.initializer.
Problem
Upper mentioned initialization works properly when the model is initialized and created for the first time and some initial training can be done without problems.
If chief worker crashes or is shut down purposefully, after restarting MonitoredTrainingSession shows following error as if iterator is not initialized:
FailedPreconditionError (see above for traceback): GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element...
Workaround
Right now the only solution that works for is to run initialization internally using _coordinated_creator.tf_sess.run:
mon_sess._coordinated_creator.tf_sess.run(iter_init_op)
This doesn’t look like an intended use.
Statement:
This doesn’t seem as an intended behaviour.
What is a better way to use initializable_iterator with MonitoredTrainingSession or lookup_ops.index_table_from_tensor with one_shot_iterator?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 32 (13 by maintainers)
Related to @dongjk’s question/problem, I have an explicit example of loading tfrecords with the
DatasetAPI on a modified version of the tensorflow/models/research/resnet example, with @ispirmustafa’s suggestion of using a hook (thank you for that suggestion btw, it solved a very long headache 😃). Hopefully it helps “connect the dots” in case anyone like me is having trouble piecing all of this together and making a working example with their own data. I really like theDatasetversion of doing things, and am looking forward to see how it develops further. Thanks!@dongjk you can use a hook to do all you need. Following is a code example.
This happens because the
init_opis not run when the worker restarts from a checkpoint. The relevant implementation is inSessionManager.prepare_session().I think for the purposes of the
MonitoredSession, an initializable iterator is more like a “local variable” (which are reinitialized on each worker when they start) than a “global variable” (which are initialized once by the chief, and then restored from checkpoints). Could you try moving the initializer to theScaffold.local_init_opand see if that fixes things?(This is clearly “not great”. We’re still figuring out a more elegant way to integrate Datasets with the
MonitoredSessionandEstimatorAPIs. Hopefully this suggestion works in the meantime.)/cc @ispirmustafa for
MonitoredSessionwisdom.see https://stackoverflow.com/questions/45945881/tf-train-monitoredtrainingsession-and-reinitializable-iterator-from-dataset
Scaffold.local_init_opworks as intended, thanks for the suggestion. Here is a working example for future reference:Datasets is a pleasant API, thanks for the effort.
@mrry Thanks, It works without any hooks 👍. So the solution is to invoke the
iterator.string_handle()before creating theMonitoredSession.Thanks for confirming that that works! We’re definitely still looking for a way to make Datasets work more naturally with
MonitoredSession, though 😃.Right, the handles could potentially be collected in the
after_session_create()method. Annoyingly, you’ll need to calliterator.string_handle()on each of the iterators before creating theMonitoredSession, and pass the resulting string-valuedtf.Tensorobjects to the hook.