tensorflow: OSError: [Errno 9] Bad file descriptor raised on program exit

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python version: Python 3.8.5
CUDA/cuDNN version: 11.2 / 8.1.0.77-1
GPU model and memory: P100

Describe the current behavior

When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:

Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Describe the expected behavior

Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)

Contributing

Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue

import tensorflow


def f():
    strategy = tensorflow.distribute.MirroredStrategy()
    with strategy.scope():
        tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
            tensorflow.keras.layers.Input(shape=(88, 88, 3))
        )


f()

Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 11
Comments: 32 (11 by maintainers)

Commits related to this issue

Fix Yolo v4 example non-fatal teardown error - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 Signed-off... — committed to JanuszL/DALI by JanuszL 2 years ago
Fix Yolo v4 example non-fatal teardown error - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 Signed-off... — committed to JanuszL/DALI by JanuszL 2 years ago
Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to NVIDIA/DALI by JanuszL 2 years ago
Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to cyyever/DALI by JanuszL 2 years ago
Fix Yolo v4 example non-fatal teardown error (#3739) - fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in https://github.com/tensorflow/tensorflow/issues/50487 ... — committed to cyyever/DALI by JanuszL 2 years ago
Update YOLO example for the latest to support the latest TensorFlow version - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllRe... — committed to JanuszL/DALI by JanuszL 2 years ago
Update YOLO example for the latest to support the latest TensorFlow version (#4522) - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'Colle... — committed to NVIDIA/DALI by JanuszL 2 years ago
Fixing OSError at end of trainning MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on sh... — committed to Bruno-Messias/yoloret by Bruno-Messias a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to ray-project/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to cadedaniel/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to edoakes/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to peytondmurray/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to scottsun94/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to cassidylaidlaw/ray by krfricke a year ago
[ci/docker/ml] Upgrade tensorflow to 2.11.0 (#32511) We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Ten... — committed to elliottower/ray by krfricke a year ago

Most upvoted comments

This happens in TF 2. 7 too with python 3.9

I think it’s because MirroredStrategy creates a multiprocessing ThreadPool, but doesn’t close it before the program ends, so its resources aren’t properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

+23

tekumara on Dec 19, 2021

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

npanpaliya on Aug 31, 2021

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

bryanlimy on Aug 27, 2021

Hi, sorry for the inconvenience but now I’ve tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from 2.3 to 2.9. Maybe some outdated dependency is causing the error.

In case you want to reproduce it my versions are: Tensorflow version: 2.9.1 Python version: 2.8.13 OS: Ubuntu 18.04

And the output of pip freeze:

absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
antlr4-python3-runtime==4.8
astunparse==1.6.3
async-timeout==4.0.1
atomicwrites==1.4.0
attrs==21.2.0
backcall==0.2.0
bitarray==2.3.7
blessed==1.19.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==8.0.3
colorama==0.4.4
Cython==0.29.24
dataclasses==0.6
datasets==1.16.1
decorator==5.1.0
dill==0.3.4
enlighten==1.10.1
fairseq==0.10.2
fastspell==0.1.5
fasttext==0.9.2
filelock==3.3.2
flatbuffers==1.12
frozenlist==1.2.0
fsspec==2021.11.1
ftfy==6.1.1
fuzzywuzzy==0.18.0
gast==0.4.0
gensim==4.1.2
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.41.1
h5py==3.1.0
hanzidentifier==1.0.2
huggingface-hub==0.1.0
hunspell==0.5.5
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
ipython==7.29.0
jedi==0.18.0
joblib==0.14.1
keras==2.9.0
Keras-Preprocessing==1.1.2
latexcodec==2.0.1
libclang==13.0.0
Markdown==3.3.4
matplotlib-inline==0.1.3
monocleaner==1.0
more-itertools==8.10.0
mtdata==0.3.1
multidict==5.2.0
multiprocess==0.70.12.2
nltk==3.6.5
numpy==1.23.0
oauthlib==3.1.1
omegaconf==2.1.1
opt-einsum==3.3.0
packaging==21.2
pandas==1.3.5
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pluggy==0.13.1
portalocker==2.3.0
prefixed==0.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.1
pybtex==0.24.0
pycld2==0.31
pycparser==2.21
Pygments==2.10.0
pyparsing==2.4.7
pypinyin==0.46.0
pytest==5.1.2
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyYAML==5.4.1
regex==2022.3.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
sacrebleu==2.1.0
sacremoses==0.0.43
scikit-learn==0.22.1
scipy==1.4.1
sentence-transformers==2.1.0
sentencepiece==0.1.94
six==1.15.0
smart-open==5.2.1
tabulate==0.8.9
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.0.0
tokenizers==0.12.1
toolwrapper==0.4.1
torch==1.10.1
torch-train==0.0.3
torchsummary==1.5.1
torchvision==0.11.2
tqdm==4.62.3
traitlets==5.1.1
transformers==4.20.1
typing-extensions==3.7.4.3
Unidecode==1.2.0
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.2
wrapt==1.12.1
xxhash==2.0.2
yarl==1.7.2
zhon==1.1.5
zipp==3.7.0

ZJaume on Aug 9, 2022

Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:

In [1]: import tensorflow
   ...: def f():
   ...:    strategy = tensorflow.distribute.MirroredStrategy()
   ...:    with strategy.scope():
   ...:       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
   ...:             tensorflow.keras.layers.Input(shape=(88, 88, 3))
   ...:         )
   ...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory:  -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

ZJaume on Jul 29, 2022

The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? https://bugs.python.org/issue39995

jayfurmanek on Aug 30, 2021

Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there’s a back slash lost before I pass --distribution_strategy. Stupid me > <. Thanks greatly for your help !

suchunxie on Jul 7, 2022

@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. “one_device” is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.

npanpaliya on Jul 6, 2022

In my case, I’d to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22

npanpaliya on Jul 6, 2022

This happens in TF 2. 7 too with python 3.9

I think it’s because MirroredStrategy creates a multiprocessing ThreadPool, but doesn’t close it before the program ends, so its resources aren’t properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).

The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7

The fix is basically the same as this one, but you have to close two pools:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore

niklaspechan on Mar 17, 2022