tensorflow: OSError: [Errno 9] Bad file descriptor raised on program exit
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below):
v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0 - Python version:
Python 3.8.5 - CUDA/cuDNN version:
11.2/8.1.0.77-1 - GPU model and memory: P100
Describe the current behavior
When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:
Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
Describe the expected behavior
Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)
- Do you want to contribute a PR? (yes/no): No
Standalone code to reproduce the issue
import tensorflow
def f():
strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():
tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
tensorflow.keras.layers.Input(shape=(88, 88, 3))
)
f()
Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 11
- Comments: 32 (11 by maintainers)
Commits related to this issue
- Fix Yolo v4 example non-fatal teardown error - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 Signed-off... — committed to JanuszL/DALI by JanuszL 2 years ago
- Fix Yolo v4 example non-fatal teardown error - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 Signed-off... — committed to JanuszL/DALI by JanuszL 2 years ago
- Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to NVIDIA/DALI by JanuszL 2 years ago
- Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to cyyever/DALI by JanuszL 2 years ago
- Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to cyyever/DALI by JanuszL 2 years ago
- Update YOLO example for the latest to support the latest TensorFlow version - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllRe... — committed to JanuszL/DALI by JanuszL 2 years ago
- Update YOLO example for the latest to support the latest TensorFlow version (#4522) - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'Colle... — committed to NVIDIA/DALI by JanuszL 2 years ago
- Fixing OSError at end of trainning MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on sh... — committed to Bruno-Messias/yoloret by Bruno-Messias a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to ray-project/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to cadedaniel/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to edoakes/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to peytondmurray/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to scottsun94/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to cassidylaidlaw/ray by krfricke a year ago
- [ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to elliottower/ray by krfricke a year ago
This happens in TF 2. 7 too with python 3.9
I think it’s because MirroredStrategy creates a multiprocessing ThreadPool, but doesn’t close it before the program ends, so its resources aren’t properly cleaned up and it errors on shutdown.
You can explicitly close the pool on exit using:
Which should prevent the error for now (until there is a fix).
I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.
Same issue in tf v2.6.
OSErroron program exit ifstrategy.scope()is called within a function.The following code causes
OSErroron exit.with the following output:
Whereas the one below is fine
Also tested the same code snippet with tf v2.4 and it ran fine in both cases.
Hi, sorry for the inconvenience but now I’ve tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from
2.3to2.9. Maybe some outdated dependency is causing the error.In case you want to reproduce it my versions are: Tensorflow version:
2.9.1Python version:2.8.13OS: Ubuntu 18.04And the output of
pip freeze:Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:
The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? https://bugs.python.org/issue39995
Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there’s a back slash lost before I pass --distribution_strategy. Stupid me > <. Thanks greatly for your help !
@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. “one_device” is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.
In my case, I’d to specify
--distribution_strategy=one_devicehere in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22The same issue occurs with the
MultiWorkerMirroredStrategy(when using it on one machine as recommended here), on Python3.9.10and tf2.7The fix is basically the same as this one, but you have to close two pools: