tensorflow: AttributeError: 'PerReplica' object has no attribute 'begin'

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): b’unknown’ 1.13.1 (installed with conda)
  • Python version: Python 3.6.8 :: Anaconda, Inc.
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: cuda/9.0.176, cudnn/7.3
  • GPU model and memory: Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15190 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2019-04-21 19:03:25.539522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15190 MB memory) -> physical GPU (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:87:00.0, compute capability: 6.0)

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the current behavior When running tf.estimator.Estimator model that registers tf.train.SessionRunHook to evaluation_hooks of tf.estimator.EstimatorSpec in distributed environment, an error AttributeError: 'PerReplica' object has no attribute 'begin' occurs at the beggining of evaluation. This error does not happen if I do not register SessionRunHook to evaluation_hooks. Registering SessionRunHook to training_hooks does not trigger the error even if it is in distributed mode.

I ran my Estimator with tf.estimator.train_and_evaluate.

The distribution configuration I used is tf.contrib.distribute.MirroredStrategy.

The whole error log is attatched at the end.

Describe the expected behavior

Somehow SessionRunHook turned into PerReplica at some point in evaluation code of Estimator. It should remain SessionRunHook’s interface in distribution mode.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

This is not a runnable code, but introducing the modification below to some estimator examples might work as a reproducer.

distribution = tf.contrib.distribute.MirroredStrategy()
run_config = tf.estimator.RunConfig(train_distribute=distribution,
                                                           eval_distribute=distribution)

hook = tf.train.ProfilerHook(output_dir=model_dir)  # example hook
def model_fn(features, labels, mode, params):
            if mode == tf.estimator.ModeKeys.EVAL:
                return tf.estimator.EstimatorSpec(mode, loss,
                                                      evaluation_hooks=[hook])

estimator = tf.estimator.Estimator(params, model_dir, run_config)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Efficient allreduce is not supported for IndexedSlices.
WARNING:tensorflow:Efficient allreduce is not supported for IndexedSlices.
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce invoked for batches size = 1 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2019-04-21 18:41:17.901359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-04-21 18:41:17.901414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-21 18:41:17.901425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2019-04-21 18:41:17.901432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2019-04-21 18:41:17.901439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2019-04-21 18:41:17.902038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15190 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0)
2019-04-21 18:41:17.902219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15190 MB memory) -> physical GPU (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:87:00.0, compute capability: 6.0)
WARNING:tensorflow:From /home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tmp/model.ckpt-0
WARNING:tensorflow:From /home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/model.ckpt.
INFO:tensorflow:Initialize strategy
2019-04-21 18:42:04.023667: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-04-21 18:42:05.162460: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x555559b95370
2019-04-21 18:42:05.970327: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x555559ba9960
INFO:tensorflow:loss = 52170.477, step = 0
INFO:tensorflow:global_step/sec: 0.0334123
INFO:tensorflow:loss = 54870.64, step = 1 (29.929 sec)
INFO:tensorflow:Saving checkpoints for 3 into /tmp/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-21T09:43:02Z
Traceback (most recent call last):
  File "train.py", line 146, in <module>
    main()
  File "train.py", line 142, in main
    use_multi_gpu)
  File "train.py", line 83, in train_and_evaluate
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1287, in _actual_train_model_distributed
    saving_listeners)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 582, in after_run
    if self._save(run_context.session, global_step):
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 607, in _save
    if l.after_save(session, step):
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 913, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 469, in evaluate
    name=name)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 509, in _actual_eval
    return _evaluate()
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 500, in _evaluate
    output_dir=self.eval_dir(name))
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1537, in _evaluate_run
    config=self._session_config)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 934, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/8/18IA1142/miniconda3/envs/tacotron2-tf-1.13/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 636, in __init__
    h.begin()
AttributeError: 'PerReplica' object has no attribute 'begin'

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

Any update on this? It would be great to be able to use evaluation hooks in a distributed setting.