pangeo-forge-recipes: Hanging during `store_chunk` loop

While manually executing https://github.com/pangeo-forge/staged-recipes/pull/66, I encountered recurring hangs at multiple places within this loop:

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/c4867719c4866ec46436b5d53d95ceb470a6f72e/pangeo_forge_recipes/recipes/xarray_zarr.py#L590-L626

In order to pinpoint the hang, I added these additional logs to xarray_zarr.py. Anecdotally, running with these additional logs indicated that the hang occurred most often at the call to np.asarray, but not exclusively at that line. None of the source file variable arrays for this recipe exceed much more than ~60 MB, and I was running a Large (12 GB - 16 GB) notebook server on https://us-central1-b.gcp.pangeo.io/hub, so it’s hard to imagine it was a memory issue. (Note also that input file to target chunk mapping of this recipe is 1:1, and the input files are about ~80 MB in size.)

Hanging within the store_chunk loop arose at unpredictable intervals: usually after writing between 5 and 25 chunks, but sometimes after writing as many as a few hundred. Again anecdotally, it seemed that the hangs became more frequent once I’d reached around the 1500th loop or so.

KeyboardInterrupting and restarting the store_chunk loop from the hang location (without restarting the kernel) always resolved the issue and allowed the write to continue for another 5-25 writes before hanging again.

Ultimately, I restarted the loop like this a few dozen times, and eventually got all 2117 inputs written. During this process, I tried the following, to no avail:

This feels like an environment or cloud I/O issue to me, but at this point I’m prepared to believe anything.

I’ve made a notebook which reproduces the execution scenario here: https://github.com/cisaacstern/pangeo-forge-debug-examples/tree/soda342-ice … with the caveat the storage target is an fsspec.LocalFileSystem (since we can’t include cloud creds in a public repo). The Binder link in the README does allow the notebook to be run, but it looks like Binder’s local disk allotment of 500 MB (?) may fill up before the hang is reproduced.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 73 (70 by maintainers)

Most upvoted comments

Just a quick note: I’m running a job to generate STAC items for a bunch of NetCDF files and it seems like the hanging might still be present with h5netcdf 0.12.0, at least for my workload 😦

Now I’m worried that xarray is creating similar references.

I think we already know roughly what the issue is, but here’s a potentially simpler reproducer, with just fsspec and h5netcdf. I suspect we can simplify this a bit by using h5py directly.

import fsspec
import h5netcdf
import h5py


def main():
    url = "https://pangeo.blob.core.windows.net/pangeo-public/hurs_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc"
    for i in range(100):
        with fsspec.open(url).open() as o:
            h5netcdf.core.File(o, mode="r", decode_vlen_strings=True)

        print("--------------------------------------", i)


if __name__ == "__main__":
    main()

It’s possible that xarray is be careful with circular references. At least my method for detecting them isn’t showing anything:

import fsspec
import planetary_computer
import xarray as xr
import pympler.garbagegraph
import pympler.refgraph
import gc

# read once to get all the imports cached
credential = planetary_computer.sas.get_token("nasagddp", "nex-gddp-cmip6").token
file = 'nex-gddp-cmip6/NEX/GDDP-CMIP6/ACCESS-CM2/ssp245/r1i1p1f1/hurs/hurs_day_ACCESS-CM2_ssp245_r1i1p1f1_gn_2044.nc'
url = f"https://nasagddp.blob.core.windows.net/{file}?{credential}"
f = fsspec.open(url).open()
ds = xr.open_dataset(f)

# read a second file to avoid the cache
file = 'nex-gddp-cmip6/NEX/GDDP-CMIP6/ACCESS-CM2/ssp245/r1i1p1f1/hurs/hurs_day_ACCESS-CM2_ssp245_r1i1p1f1_gn_2045.nc'
url = f"https://nasagddp.blob.core.windows.net/{file}?{credential}"
f = fsspec.open(url).open()

pympler.garbagegraph.start_debug_garbage()
ds = xr.open_dataset(f)

gb = pympler.garbagegraph.GarbageGraph()

gb.render("garbage2.pdf", format="pdf")

Which gives:

image

The cycle on the right is caused by a closure in h5netcdf that I missed.

The larger cycle on the left is from fsspec, with circular references between the filesystem, the file, cache, and the OpenFile object. With some surgery it possible to remove that cycle.

But I think this all underscores how fragile this whole I/O layering is. This is something that has to be rock solid. It feels like to do this (reading NetCDF files over the network) properly we need to be able to pass a URL to the C NetCDF library and have it do the proper range requests, etc. That also seems like a ton of work, so` I’m not really sure what to do. I guess we keep patching around things at the python layer.

h5netcdf 0.12.0 is out and hopefully fixes the problem. If anyone has a workflow to run that semi-reliably triggers the errors, it’d be great to test out before closing this issue (maybe @cisaacstern has one in mind?)

I spent another 30 minutes trying to get a pure h5py + fsspec reproducer, without success. For reference, I’m trying to reproduce the parts of h5netcdf.File.__init__ where things seem to hang:

import fsspec
import h5netcdf
import h5py
import gc
import logging


def main():
    url = "https://pangeo.blob.core.windows.net/pangeo-public/hurs_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc"
    for i in range(100):
        with fsspec.open(url).open() as o:
            # h5netcdf.core.File(o, mode="r", decode_vlen_strings=True)
            h5file = h5py.File(o)
            root = h5file["/"]
            # for key in ["hurs", "lat", "lon", "time"]:
            dim_variable = h5file["/time"]

            for ref, _ in dim_variable.attrs["REFERENCE_LIST"]:
                var = root[ref]
                for j, var_d in enumerate(var.dims):
                    print(j, "var_d", var_d)
                    var_d[0].name
                var[:]
        print("--------------------------------------", i)

AFAICT, that should be roughly equivalent. I might be making an error (most likely) but it’s possible this is meaningful. If this is actually an interaction between the garbage collector, fsspec, and h5py objects, then maybe we need to be making circular references that get cleaned up during the gc (i.e. h5netcdf is creating those circular references). I’ll open an issue on h5py now to see if we can get some additional help debugging.

edit: https://github.com/h5py/h5py/issues/2019

Ok I have it running. Mine hung on i=8.

I would propose copying this file to a fully public location to eliminate the token stuff. Then let’s open an issue in h5py. They will likely try to kick it back to fsspec, but at least we can try to move forward towards resolving.

Just noting in manual execution of https://github.com/pangeo-forge/staged-recipes/pull/97#issuecomment-990162893 today, this issue remains very much present. KeyboardInterrupting and restarting the loop from the hang location is the workaround.

Not really no. It’s more of a hope that #218 will close this.

for the present 50 occurrence test, it was always None

I tried to follow that path a little - glad it’s not the case, as I was really puzzled!

Thanks for continuing to push on this Charles. I think we need to get to the bottom fo the issue with a minimal reproducer.

In the meantime, I plan to work this week on implementing the “reference” reading approach.

Is there a way to completely disable the async layer of fsspec

Absolutely not - they call libraries that are implemented in pure-async.

Two other thing that may be worth trying for the sake of elimination:

  • running with dask’s “sync” scheduler
  • running without dask but using ordinary threads

OK, so explicitly its the mixture of xarray/h5py/fsspec/threads. If gc doesn’t make a difference, then I am once again at a loss.