tensorflow: AttributeError: 'PerReplica' object has no attribute 'begin'
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): b’unknown’ 1.13.1 (installed with conda)
- Python version: Python 3.6.8 :: Anaconda, Inc.
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: cuda/9.0.176, cudnn/7.3
- GPU model and memory: Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15190 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2019-04-21 19:03:25.539522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15190 MB memory) -> physical GPU (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:87:00.0, compute capability: 6.0)
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”
Describe the current behavior
When running tf.estimator.Estimator model that registers tf.train.SessionRunHook to evaluation_hooks of tf.estimator.EstimatorSpec in distributed environment, an error AttributeError: 'PerReplica' object has no attribute 'begin' occurs at the beggining of evaluation. This error does not happen if I do not register SessionRunHook to evaluation_hooks. Registering SessionRunHook to training_hooks does not trigger the error even if it is in distributed mode.
I ran my Estimator with tf.estimator.train_and_evaluate.
The distribution configuration I used is tf.contrib.distribute.MirroredStrategy.
The whole error log is attatched at the end.
Describe the expected behavior
Somehow SessionRunHook turned into PerReplica at some point in evaluation code of Estimator. It should remain SessionRunHook’s interface in distribution mode.
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.
This is not a runnable code, but introducing the modification below to some estimator examples might work as a reproducer.
distribution = tf.contrib.distribute.MirroredStrategy()
run_config = tf.estimator.RunConfig(train_distribute=distribution,
eval_distribute=distribution)
hook = tf.train.ProfilerHook(output_dir=model_dir) # example hook
def model_fn(features, labels, mode, params):
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss,
evaluation_hooks=[hook])
estimator = tf.estimator.Estimator(params, model_dir, run_config)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Efficient allreduce is not supported for IndexedSlices.
WARNING:tensorflow:Efficient allreduce is not supported for IndexedSlices.
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2019-04-21 18:41:17.901359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-04-21 18:41:17.901414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-21 18:41:17.901425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2019-04-21 18:41:17.901432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2019-04-21 18:41:17.901439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2019-04-21 18:41:17.902038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15190 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0)
2019-04-21 18:41:17.902219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15190 MB memory) -> physical GPU (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:87:00.0, compute capability: 6.0)
WARNING:tensorflow:From /home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tmp/model.ckpt-0
WARNING:tensorflow:From /home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/model.ckpt.
INFO:tensorflow:Initialize strategy
2019-04-21 18:42:04.023667: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-04-21 18:42:05.162460: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x555559b95370
2019-04-21 18:42:05.970327: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x555559ba9960
INFO:tensorflow:loss = 52170.477, step = 0
INFO:tensorflow:global_step/sec: 0.0334123
INFO:tensorflow:loss = 54870.64, step = 1 (29.929 sec)
INFO:tensorflow:Saving checkpoints for 3 into /tmp/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-21T09:43:02Z
Traceback (most recent call last):
File "train.py", line 146, in <module>
main()
File "train.py", line 142, in main
use_multi_gpu)
File "train.py", line 83, in train_and_evaluate
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1287, in _actual_train_model_distributed
saving_listeners)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
run_metadata=run_metadata))
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 582, in after_run
if self._save(run_context.session, global_step):
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 607, in _save
if l.after_save(session, step):
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 537, in _evaluate
self._evaluator.evaluate_and_export())
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 913, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 469, in evaluate
name=name)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 509, in _actual_eval
return _evaluate()
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 500, in _evaluate
output_dir=self.eval_dir(name))
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1537, in _evaluate_run
config=self._session_config)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 934, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 636, in __init__
h.begin()
AttributeError: 'PerReplica' object has no attribute 'begin'
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (8 by maintainers)
This has been fixed (https://github.com/tensorflow/estimator/commit/131f54a62ae9ded9057aeb0eb1243d9516373b14). Please test with TF nightly.
Any update on this? It would be great to be able to use evaluation hooks in a distributed setting.