h5py: Potential deadlock from some combination of `h5py`, `gc`, and `fsspec`

Version info:

Summary of the h5py configuration
---------------------------------

h5py    3.6.0
HDF5    1.12.1
Python  3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) 
[GCC 9.4.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.20.0
cython (built with) 0.29.24
numpy (built against) 1.18.5
HDF5 (built against) 1.12.1

This script creates a deadlock, typically by the 12th or so iteration:

import fsspec
import h5netcdf
import h5py


def main():
    url = "https://pangeo.blob.core.windows.net/pangeo-public/hurs_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc"
    for i in range(100):
        with fsspec.open(url).open() as o:
            h5netcdf.core.File(o, mode="r", decode_vlen_strings=True)

        print("--------------------------------------", i)


if __name__ == "__main__":
    main()

The notable features are:

  1. It uses fsspec / aiohttp to open the URL and provide a file-like object. fsspec does some things with threads and the asyncio event loop that I don’t really understand (cc @martindurant)
  2. It uses h5netcdf to open the file. Apologies for not having a completely minimal example, but I haven’t been able to reproduce the hang with just h5py. https://github.com/dask/dask/issues/7547#issuecomment-872563337 indicates that this might be interacting with CPython’s garbage collector, so it’s possibele that h5netcdf is creating the circular references necessary to trigger the deadlock

Here’s the traceback when I interrupt it:

(notebook) jovyan@jupyter-taugspurger-40microsoft-2ecom:~/cmip6$ $ python hang.py 
-------------------------------------- 0
-------------------------------------- 1
-------------------------------------- 2
-------------------------------------- 3
-------------------------------------- 4
^CTraceback (most recent call last):
  File "hang.py", line 16, in <module>
    main()
  File "hang.py", line 10, in main
    h5netcdf.core.File(o, mode="r", decode_vlen_strings=True)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/h5netcdf/core.py", line 780, in __init__
    super(File, self).__init__(self, self._h5path)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/h5netcdf/core.py", line 309, in __init__
    self._current_dim_sizes[k] = self._determine_current_dimension_size(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/h5netcdf/core.py", line 382, in _determine_current_dimension_size
    name = _name_from_dimension(var_d)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/h5netcdf/core.py", line 51, in _name_from_dimension
    return dim[0].name.split("/")[-1]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/h5py/_hl/dims.py", line 74, in __getitem__
    h5ds.iterate(self._id, self._dimension, scales.append, 0)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5ds.pyx", line 167, in h5py.h5ds.iterate
  File "h5py/h5fd.pyx", line 162, in h5py.h5fd.H5FD_fileobj_read
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1575, in readinto
    data = self.read(out.nbytes)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/implementations/http.py", line 566, in read
    return super().read(length)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/spec.py", line 1565, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/caching.py", line 391, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/fsspec/asyn.py", line 59, in sync
    if event.wait(1):
  File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Happy to provide additional info / do more debugging, but I’m at a bit of a loss right now.

I also tried to dump the gc debug info with gc.set_debug(gc.DEBUG_STATS | gc.DEBUG_COLLECTABLE | gc.DEBUG_UNCOLLECTABLE | gc.DEBUG_LEAK). It’s pretty verbose and I’m not sure what to make of it.

gc: collecting generation 0...
gc: objects in each generation: 747 3197 24898
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 731 3641 24889
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 637 4012 24889
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 746 4268 24889
gc: objects in permanent generation: 0
gc: collectable <_CountingAttr 0x7fbe6595b2c0>
gc: collectable <dict 0x7fbe65947dc0>
gc: collectable <type 0x55d947f89c80>
gc: collectable <getset_descriptor 0x7fbe65959340>
gc: collectable <getset_descriptor 0x7fbe65959480>
gc: collectable <tuple 0x7fbe659592c0>
gc: collectable <_CountingAttr 0x7fbe29654f40>
gc: collectable <dict 0x7fbe65959300>
gc: collectable <type 0x55d947f8d2a0>
gc: collectable <getset_descriptor 0x7fbe2972c340>
gc: collectable <getset_descriptor 0x7fbe6599f740>
gc: collectable <tuple 0x7fbe65965900>
gc: collectable <_CountingAttr 0x7fbe6595b4a0>
gc: collectable <_CountingAttr 0x7fbe6595b540>
gc: collectable <_CountingAttr 0x7fbe6595b5e0>
gc: collectable <dict 0x7fbe65959540>
gc: collectable <type 0x55d947f8e590>
gc: collectable <getset_descriptor 0x7fbe29af18c0>
gc: collectable <getset_descriptor 0x7fbe6596a300>
gc: collectable <tuple 0x7fbe65965b40>
gc: collectable <_CountingAttr 0x7fbe6595b860>
gc: collectable <dict 0x7fbe659659c0>
gc: collectable <type 0x55d947f8fe40>
gc: collectable <getset_descriptor 0x7fbe6596ac40>
gc: collectable <getset_descriptor 0x7fbe6596ac80>
gc: collectable <tuple 0x7fbe6596a540>
gc: collectable <_CountingAttr 0x7fbe6595b9a0>
gc: collectable <dict 0x7fbe6596a080>
gc: collectable <type 0x55d947f90de0>
gc: collectable <getset_descriptor 0x7fbe2984f300>
gc: collectable <getset_descriptor 0x7fbe6596f3c0>
gc: collectable <tuple 0x7fbe6596a100>
gc: collectable <_CountingAttr 0x7fbe6595bae0>
gc: collectable <dict 0x7fbe6596a380>
gc: collectable <type 0x55d947f91fd0>
gc: collectable <getset_descriptor 0x7fbe6596fb40>
gc: collectable <getset_descriptor 0x7fbe6596fb80>
gc: collectable <tuple 0x7fbe6596f180>
gc: done, 38 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 816 4758 24889
gc: objects in permanent generation: 0
gc: collectable <_CountingAttr 0x7fbe6595bc20>
gc: collectable <_CountingAttr 0x7fbe6595bcc0>
gc: collectable <dict 0x7fbe6596f8c0>
gc: collectable <type 0x55d947f93f90>
gc: collectable <getset_descriptor 0x7fbe29e25840>
gc: collectable <getset_descriptor 0x7fbe2999f500>
gc: collectable <tuple 0x7fbe6596fdc0>
gc: collectable <_CountingAttr 0x7fbe658fa0e0>
gc: collectable <_CountingAttr 0x7fbe658fa180>
gc: collectable <_CountingAttr 0x7fbe658fa220>
gc: collectable <dict 0x7fbe65973240>
gc: collectable <type 0x55d947f98b90>
gc: collectable <getset_descriptor 0x7fbe658fe1c0>
gc: collectable <getset_descriptor 0x7fbe658fe200>
gc: collectable <tuple 0x7fbe65973b40>
gc: collectable <_CountingAttr 0x7fbe658fa7c0>
gc: collectable <_CountingAttr 0x7fbe658fa860>
gc: collectable <_CountingAttr 0x7fbe658fa900>
gc: collectable <_CountingAttr 0x7fbe658fa9a0>
gc: collectable <dict 0x7fbe658fe780>
gc: collectable <type 0x55d947f9c080>
gc: collectable <getset_descriptor 0x7fbe299c2d40>
gc: collectable <getset_descriptor 0x7fbe299c2e80>
gc: collectable <tuple 0x7fbe65901740>
gc: done, 24 unreachable, 0 uncollectable, 0.0009s elapsed
gc: collecting generation 0...
gc: objects in each generation: 662 0 29857
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 750 490 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 694 972 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 724 1249 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 705 1895 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 705 2166 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 683 2421 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 798 2624 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 852 2937 29832
gc: objects in permanent generation: 0
gc: collectable <dict 0x7fbe6568bbc0>
gc: collectable <type 0x55d94802dbe0>
gc: collectable <getset_descriptor 0x7fbe6568bc00>
gc: collectable <getset_descriptor 0x7fbe6568bcc0>
gc: collectable <tuple 0x7fbe65669ec0>
gc: collectable <dict 0x7fbe6561c680>
gc: collectable <type 0x55d948011050>
gc: collectable <getset_descriptor 0x7fbe6561cf00>
gc: collectable <getset_descriptor 0x7fbe6561cf40>
gc: collectable <tuple 0x7fbe6561ca00>
gc: done, 10 unreachable, 0 uncollectable, 0.0002s elapsed
gc: collecting generation 0...
gc: objects in each generation: 777 3456 29832
gc: objects in permanent generation: 0
gc: collectable <dict 0x7fbe65626680>
gc: collectable <type 0x55d94803c5c0>
gc: collectable <getset_descriptor 0x7fbe6562b380>
gc: collectable <getset_descriptor 0x7fbe65621bc0>
gc: collectable <tuple 0x7fbe65621800>
gc: done, 5 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 834 3924 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 853 4500 29832
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0011s elapsed
gc: collecting generation 0...
gc: objects in each generation: 912 0 34707
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 652 530 34707
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 815 897 34707
gc: objects in permanent generation: 0
gc: collectable <_CountingAttr 0x7fbe654505e0>
gc: collectable <dict 0x7fbe653e6900>
gc: collectable <type 0x55d9480741f0>
gc: collectable <getset_descriptor 0x7fbe299c2e40>
gc: collectable <getset_descriptor 0x7fbe653e9180>
gc: collectable <tuple 0x7fbe6545d100>
gc: collectable <dict 0x7fbe653e9200>
gc: collectable <type 0x55d948075e60>
gc: collectable <getset_descriptor 0x7fbe653e9b80>
gc: collectable <getset_descriptor 0x7fbe653e9bc0>
gc: collectable <tuple 0x7fbe653e6680>
gc: done, 11 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 742 1406 34699
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 795 1789 34690
gc: objects in permanent generation: 0
gc: collectable <dict 0x7fbe65382d00>
gc: collectable <type 0x55d948082290>
gc: collectable <getset_descriptor 0x7fbe65382d80>
gc: collectable <getset_descriptor 0x7fbe65382dc0>
gc: collectable <tuple 0x7fbe656f18c0>
gc: collectable <dict 0x7fbe65382ec0>
gc: collectable <type 0x55d94808dbb0>
gc: collectable <getset_descriptor 0x7fbe65388980>
gc: collectable <getset_descriptor 0x7fbe653889c0>
gc: collectable <tuple 0x7fbe656f1680>
gc: collectable <dict 0x7fbe65388140>
gc: collectable <type 0x55d94808f000>
gc: collectable <getset_descriptor 0x7fbe6538e200>
gc: collectable <getset_descriptor 0x7fbe6538e240>
gc: collectable <tuple 0x7fbe653708c0>
gc: collectable <dict 0x7fbe653887c0>
gc: collectable <type 0x55d948090450>
gc: collectable <getset_descriptor 0x7fbe6538ea00>
gc: collectable <getset_descriptor 0x7fbe6538ea40>
gc: collectable <tuple 0x7fbe65370400>
gc: done, 20 unreachable, 0 uncollectable, 0.0002s elapsed
gc: collecting generation 0...
gc: objects in each generation: 753 2343 34689
gc: objects in permanent generation: 0
gc: collectable <dict 0x7fbe6538e300>
gc: collectable <type 0x55d948092d30>
gc: collectable <getset_descriptor 0x7fbe65396a40>
gc: collectable <getset_descriptor 0x7fbe65396a80>
gc: collectable <tuple 0x7fbe65372b00>
gc: collectable <type 0x55d9480941a0>
gc: collectable <getset_descriptor 0x7fbe65303480>
gc: collectable <dict 0x7fbe6538eac0>
gc: collectable <getset_descriptor 0x7fbe653034c0>
gc: collectable <tuple 0x7fbe65372b80>
gc: collectable <type 0x55d948095110>
gc: collectable <getset_descriptor 0x7fbe65303c00>
gc: collectable <dict 0x7fbe6538eb00>
gc: collectable <getset_descriptor 0x7fbe65303c40>
gc: collectable <tuple 0x7fbe65303900>
gc: collectable <type 0x55d9480962d0>
gc: collectable <getset_descriptor 0x7fbe6530b340>
gc: collectable <dict 0x7fbe65396280>
gc: collectable <getset_descriptor 0x7fbe6530b380>
gc: collectable <tuple 0x7fbe6530b040>
gc: collectable <type 0x55d948097490>
gc: collectable <getset_descriptor 0x7fbe6530ba40>
gc: collectable <dict 0x7fbe65396b40>
gc: collectable <getset_descriptor 0x7fbe6530ba80>
gc: collectable <tuple 0x7fbe6530b740>
gc: collectable <type 0x55d948098650>
gc: collectable <getset_descriptor 0x7fbe65312180>
gc: collectable <dict 0x7fbe653039c0>
gc: collectable <getset_descriptor 0x7fbe653121c0>
gc: collectable <tuple 0x7fbe6530be40>
gc: collectable <dict 0x7fbe6530b800>
gc: collectable <type 0x55d948099810>
gc: collectable <getset_descriptor 0x7fbe65312880>
gc: collectable <getset_descriptor 0x7fbe653128c0>
gc: collectable <tuple 0x7fbe65312580>
gc: collectable <dict 0x7fbe6530bb40>
gc: collectable <type 0x55d947bf8810>
gc: collectable <getset_descriptor 0x7fbe653151c0>
gc: collectable <getset_descriptor 0x7fbe65315200>
gc: collectable <tuple 0x7fbe65312ac0>
gc: done, 40 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 898 2900 34648
gc: objects in permanent generation: 0
gc: collectable <dict 0x7fbe65312940>
gc: collectable <type 0x55d94809e190>
gc: collectable <getset_descriptor 0x7fbe6531b500>
gc: collectable <getset_descriptor 0x7fbe6531b540>
gc: collectable <tuple 0x7fbe65315e40>
gc: collectable <dict 0x7fbe65315280>
gc: collectable <type 0x55d94809f380>
gc: collectable <getset_descriptor 0x7fbe6531be40>
gc: collectable <getset_descriptor 0x7fbe6531be80>
gc: collectable <tuple 0x7fbe6531b7c0>
gc: collectable <dict 0x7fbe65375080>
gc: collectable <type 0x55d9480a0fe0>
gc: collectable <getset_descriptor 0x7fbe6536ddc0>
gc: collectable <getset_descriptor 0x7fbe65372f80>
gc: collectable <tuple 0x7fbe65629440>
gc: done, 15 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 504 3559 34646
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 788 3925 34646
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 707 4336 34646
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 633 4317 34646
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 1...
gc: objects in each generation: 1002 4639 34646
gc: objects in permanent generation: 0
gc: collectable <type 0x55d948071900>
gc: collectable <tuple 0x7fbe653e6540>
gc: collectable <dict 0x7fbe653e6080>
gc: collectable <getset_descriptor 0x7fbe653e65c0>
gc: collectable <getset_descriptor 0x7fbe653e6780>
gc: collectable <type 0x55d9480918c0>
gc: collectable <tuple 0x7fbe65372a80>
gc: collectable <dict 0x7fbe65388ac0>
gc: collectable <getset_descriptor 0x7fbe65396180>
gc: collectable <getset_descriptor 0x7fbe653961c0>
gc: collectable <type 0x55d94809cfa0>
gc: collectable <tuple 0x7fbe653154c0>
gc: collectable <dict 0x7fbe6530bf80>
gc: collectable <getset_descriptor 0x7fbe65315b40>
gc: collectable <getset_descriptor 0x7fbe65315b80>
gc: done, 15 unreachable, 0 uncollectable, 0.0014s elapsed
gc: collecting generation 0...
gc: objects in each generation: 685 0 39783
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0003s elapsed
gc: collecting generation 0...
gc: objects in each generation: 909 407 39744
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0003s elapsed
gc: collecting generation 0...
gc: objects in each generation: 825 698 39744
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 852 1087 39744
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 826 1371 39730
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 453 1989 39710
gc: objects in permanent generation: 0
gc: done, 0 unreachable, 0 uncollectable, 0.0001s elapsed
gc: collecting generation 0...
gc: objects in each generation: 619 2246 39710
gc: objects in permanent generation: 0
gc: collectable <cell 0x7fbe64f53520>
gc: collectable <tuple 0x7fbe64f69490>
gc: collectable <function 0x7fbe64f52b80>
gc: collectable <File 0x7fbe64f533d0>
gc: collectable <File 0x7fbe64f535e0>
gc: collectable <h5py.h5f.FileID 0x7fbe64f41a40>
gc: collectable <tuple 0x7fbe64f7c380>
gc: collectable <dict 0x7fbe64f75b40>
gc: collectable <dict 0x7fbe64f7c6c0>
gc: collectable <ChainMap 0x7fbe64f53550>
gc: collectable <list 0x7fbe64f7c180>
gc: collectable <dict 0x7fbe64f54880>
gc: collectable <ChainMap 0x7fbe64f69550>
gc: collectable <list 0x7fbe64efb680>
gc: collectable <dict 0x7fbe29b14840>
gc: collectable <ChainMap 0x7fbe64f69580>
gc: collectable <list 0x7fbe64efb780>
gc: collectable <dict 0x7fbe64f75ec0>
gc: collectable <h5py.h5g.GroupID 0x7fbe64eed950>
gc: collectable <Group 0x7fbe64f69460>
gc: collectable <dict 0x7fbe64ef6c80>
gc: collectable <ChainMap 0x7fbe64f69250>
gc: collectable <list 0x7fbe64efb880>
gc: collectable <dict 0x7fbe64eef280>
gc: collectable <_LazyObjectLookup 0x7fbe64f693d0>
gc: collectable <dict 0x7fbe64ef6ac0>
gc: collectable <collections.OrderedDict 0x7fbe64fa4bc0>
gc: collectable <_LazyObjectLookup 0x7fbe64f693a0>
gc: collectable <dict 0x7fbe64f75ac0>
gc: collectable <collections.OrderedDict 0x7fbe64fa4f40>
gc: collectable <cell 0x7fbe64ef5730>
gc: collectable <tuple 0x7fbe64ef56a0>
gc: collectable <function 0x7fbe64f66310>
gc: collectable <File 0x7fbe64f69400>
gc: collectable <File 0x7fbe64f69340>
gc: collectable <h5py.h5f.FileID 0x7fbe64ef3f40>
gc: collectable <tuple 0x7fbe64ef6c40>
gc: collectable <dict 0x7fbe64ef6bc0>
gc: collectable <dict 0x7fbe64eefdc0>
gc: collectable <ChainMap 0x7fbe64f696a0>
gc: collectable <list 0x7fbe64efba00>
gc: collectable <dict 0x7fbe64f7c880>
gc: collectable <ChainMap 0x7fbe64f697c0>
gc: collectable <list 0x7fbe64efb0c0>
gc: collectable <dict 0x7fbe64ef6f00>
gc: collectable <ChainMap 0x7fbe64ef5b20>
gc: collectable <list 0x7fbe64efbd80>
gc: collectable <dict 0x7fbe64ef6b80>
gc: collectable <h5py.h5g.GroupID 0x7fbe64ef3e50>
gc: collectable <Group 0x7fbe64ef58b0>
gc: collectable <dict 0x7fbe64efbb00>
gc: collectable <ChainMap 0x7fbe64ef5880>
gc: collectable <list 0x7fbe64efb300>
gc: collectable <dict 0x7fbe64ef6c00>
gc: collectable <_LazyObjectLookup 0x7fbe64ef59a0>
gc: collectable <dict 0x7fbe64efb280>

xref https://github.com/dask/dask/issues/7547, https://github.com/pangeo-forge/pangeo-forge-recipes/issues/177

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 38 (29 by maintainers)

Commits related to this issue

Most upvoted comments

I would just like to chime in to say THANK YOU to the h5py maintainers for responding so quickly and helpfully to this tricky issue, which has become a major stumbling block for a downstream project. I can’t claim to understand all of the details being discussed here, but it seems we have several promising leads to investigate. 🙏

Yes, there are problems, but that’s why we are here! Probably no one should be monkey-patching instance attributes, and it shouldn’t be that hard to break cycles.

This case was a problem caused by cycles and gc in combination with threads, but I think the general problem here is something we can’t really fix. HDF5 doesn’t expect a file driver to run Python code, with all the runtime complexities that brings.

E.g. if you write a program where more than one thread uses h5py, it’s OK: they won’t get concurrent access to HDF5, but they will all get access in turn. But if you give h5py a Python file-like object, now the same code might be vulnerable to deadlock: one thread could be holding the HDF5 lock and waiting for the GIL (or some application lock) while another has the opposite situation. I can’t really see a way to avoid this in h5py or in fsspec, so we’re left saying to application developers ‘this feature exists, but if you want to use it, be really carefuly with threads and reference cycles’. 😞

Ultimately, the only way to get round it is to reimplement part of HDF5 in a way that fits better with Python code supplying the underlying ‘file’ data. It sounds like this is part of what kerchunk does - reimplement the parts to read and decompress dataset chunks from an HDF5 file. But HDF5 is a massively complex format, and still growing new features, so reimplementing it completely is largely off the table.

+1 to everything Ryan said.

@TomAugspurger If you still have a dev environment set up with this, it might also be worth trying with h5py’s locking disabled

Yeah, that seemed to avoid the hanging. I didn’t get a segfault in the 100 reads I did, but maybe I just got lucky.


I’ll make a small PR to h5netcdf that avoids the problematic circular references. It seems to not hang in basic testing.

I do worry about the fragility of this kind of fix, since it’ll be very easy for h5netcdf to reintroduce a circular reference, and it only fixes things for one library in the stack. But it’s a start.

edit: https://github.com/h5netcdf/h5netcdf/pull/117

ok, a theory:

  1. h5netcdf.core.File is not being explictly closed so it is not explicitly closing the underlying h5py objects
  2. reading from s3 is very slow so h5py is spending a very long time holding phil while waiting from bit
  3. it looks like there is a 1 s poll loop in fspec (if event.wait(1): in the traceback which I assume is some “keep trying for a timeout”
  4. because there is a non-trivial amount of Python being processed on both the forground and background thread eventually someone decides to do a gc run
  5. because the gc runs on the thread that triggers it, if it is the forground thread this is OK because it has phil and the nonlocal close can run, but if it is the background thread we get the deadlock
  6. we do not see this under most circumstances because phil tends to protect things that are not waiting on another thread so if we do get this conflict the background thread just has to wait 😢 but eventually whatever the main thread is doing finishes, gc finishes, and life moves on (with maybe a surprisingly big latency as now the gc may be waiting on disk I/O).

Another test is to disable / renable gc around the read.

Another test to be start putting prints showing the thread name just before grabbing the lock in https://github.com/h5py/h5py/blob/e3740aa808a2d99c827fd84e0ff3e7e038ea75e6/h5py/_objects.pyx#L194-L206