tensorflow: Distributed tensorflow worker hangs at TF_CloseSession() when using MonitoredTrainingSession
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL7 and also Mac OS X 10.13.6
- TensorFlow installed from (source or binary): binary (pip install)
- TensorFlow version (use command below): 1.9.0+
- Python version: 2.7 (RHEL7), 3.6 (Mac)
- Bazel version (if compiling from source): N/A
- GCC/Compiler version (if compiling from source): N/A
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
- Exact command to reproduce: See below
Describe the problem
When using MonitoredTrainingSession in TensorFlow version 1.9 or higher, I’m seeing the following deadlock/hang (as reported by the hanging-threads pip package) when the context manager exits. Note: I do not see this hang for versions 1.8 or earlier. Also, note that this does not occur if using the older tf.train.Supervisor API.
---------- Thread 140682110711616 hangs ----------
File "trainer.py", line 102, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "trainer.py", line 67, in main
print("step: {}".format(step))
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 689, in __exit__
self._close_internal(exception_type)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 726, in _close_internal
self._sess.close()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 974, in close
self._sess.close()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1121, in close
_WrappedSession.close(self)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 974, in close
self._sess.close()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 974, in close
self._sess.close()
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 690, in close
tf_session.TF_CloseSession(self._session)
The code that generates this is based on the Distributed Tensorflow documentation (with a trivial/dummy model). I start one PS node and two worker nodes on a single box as follows:
rm -rf /tmp/train_logs; \
python trainer.py \
--ps_hosts=localhost:2222 \
--worker_hosts=localhost:2223,localhost:2224 \
--job_name=ps --task_index=0
python trainer.py \
--ps_hosts=localhost:2222 \
--worker_hosts=localhost:2223,localhost:2224 \
--job_name=worker --task_index=0
python trainer.py \
--ps_hosts=localhost:2222 \
--worker_hosts=localhost:2223,localhost:2224 \
--job_name=worker --task_index=1
I’ve been able to reproduce this quite consistently on:
- Mac 10.13.6, Python 3.6, TensorFlow 1.10
- RHEL7, Python2.7, TensorFlow 1.9
And the symptom goes away when switching to 1.8 or earlier.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 19 (11 by maintainers)
Commits related to this issue
- #21745: set timeout for closing worker session — committed to Rachelmorrell/tensorflow by lxl910915 6 years ago
- Sort-of-graceful shutdown Shut down parameter servers when all workers are shut down by giving each parameter server a queue and having each worker push to each ps's queue when finished. Then to ensu... — committed to cwindolf/ffn by deleted user 5 years ago
[Update: On closer inspection, the issue I encountered seems slightly different from the one reported in this issue, but I’ll leave my post below in case it’s helpful to someone.]
I had a similar issue doing asynchronous training with MNIST where once the first worker finishes, the rest of the workers would hang. I managed to solve this by disabling communication between workers (each worker only needs to talk to the ps): https://github.com/linkedin/TonY/pull/120/files
@erwa thank you, that works fine
I have a similar problem when two workers didn’t close at the same time.
In the following code, I deliberately delayed one session from closing, which could cause the program failing to terminate:
I also have set
GRPC_VERBOSITYtoDEBUG, and I see worker 1 still tries to contact worker 0 even it is closed.