pangeo-forge-recipes: Hanging during `store_chunk` loop
While manually executing https://github.com/pangeo-forge/staged-recipes/pull/66, I encountered recurring hangs at multiple places within this loop:
In order to pinpoint the hang, I added these additional logs to xarray_zarr.py. Anecdotally, running with these additional logs indicated that the hang occurred most often at the call to np.asarray, but not exclusively at that line. None of the source file variable arrays for this recipe exceed much more than ~60 MB, and I was running a Large (12 GB - 16 GB) notebook server on https://us-central1-b.gcp.pangeo.io/hub, so it’s hard to imagine it was a memory issue. (Note also that input file to target chunk mapping of this recipe is 1:1, and the input files are about ~80 MB in size.)
Hanging within the store_chunk loop arose at unpredictable intervals: usually after writing between 5 and 25 chunks, but sometimes after writing as many as a few hundred. Again anecdotally, it seemed that the hangs became more frequent once I’d reached around the 1500th loop or so.
KeyboardInterrupting and restarting the store_chunk loop from the hang location (without restarting the kernel) always resolved the issue and allowed the write to continue for another 5-25 writes before hanging again.
Ultimately, I restarted the loop like this a few dozen times, and eventually got all 2117 inputs written. During this process, I tried the following, to no avail:
- switching to https://staging.us-central1-b.gcp.pangeo.io/hub
- switching back to https://us-central1-b.gcp.pangeo.io/hub, and selectively updating all packages which could be involved in this loop to their latest versions:
xarray,numpy,dask,distributed,fsspec,s3fs, andgcsfs - removing an explicit
target_chunkskwarg from the recipe in https://github.com/pangeo-forge/staged-recipes/pull/66/commits/3b9d3fa151b6e6e12b837af391ed38f159c7fd8d, in case that was somehow redundant withnitems_per_file=1 - changing the call to
np.asarraytovar.to_numpy()(which was admittedly unlikely to help, given that xarray implementsto_numpywithnp.asarrayinternally, and also because the hang was not exclusively on that line)
This feels like an environment or cloud I/O issue to me, but at this point I’m prepared to believe anything.
I’ve made a notebook which reproduces the execution scenario here: https://github.com/cisaacstern/pangeo-forge-debug-examples/tree/soda342-ice … with the caveat the storage target is an fsspec.LocalFileSystem (since we can’t include cloud creds in a public repo). The Binder link in the README does allow the notebook to be run, but it looks like Binder’s local disk allotment of 500 MB (?) may fill up before the hang is reproduced.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 73 (70 by maintainers)
Just a quick note: I’m running a job to generate STAC items for a bunch of NetCDF files and it seems like the hanging might still be present with h5netcdf 0.12.0, at least for my workload 😦
Now I’m worried that xarray is creating similar references.
I think we already know roughly what the issue is, but here’s a potentially simpler reproducer, with just fsspec and h5netcdf. I suspect we can simplify this a bit by using h5py directly.
It’s possible that xarray is be careful with circular references. At least my method for detecting them isn’t showing anything:
Which gives:
The cycle on the right is caused by a closure in h5netcdf that I missed.
The larger cycle on the left is from fsspec, with circular references between the filesystem, the file, cache, and the OpenFile object. With some surgery it possible to remove that cycle.
But I think this all underscores how fragile this whole I/O layering is. This is something that has to be rock solid. It feels like to do this (reading NetCDF files over the network) properly we need to be able to pass a URL to the C NetCDF library and have it do the proper range requests, etc. That also seems like a ton of work, so` I’m not really sure what to do. I guess we keep patching around things at the python layer.
h5netcdf 0.12.0 is out and hopefully fixes the problem. If anyone has a workflow to run that semi-reliably triggers the errors, it’d be great to test out before closing this issue (maybe @cisaacstern has one in mind?)
I spent another 30 minutes trying to get a pure h5py + fsspec reproducer, without success. For reference, I’m trying to reproduce the parts of
h5netcdf.File.__init__where things seem to hang:AFAICT, that should be roughly equivalent. I might be making an error (most likely) but it’s possible this is meaningful. If this is actually an interaction between the garbage collector, fsspec, and h5py objects, then maybe we need to be making circular references that get cleaned up during the gc (i.e. h5netcdf is creating those circular references). I’ll open an issue on h5py now to see if we can get some additional help debugging.
edit: https://github.com/h5py/h5py/issues/2019
Ok I have it running. Mine hung on
i=8.I would propose copying this file to a fully public location to eliminate the token stuff. Then let’s open an issue in h5py. They will likely try to kick it back to fsspec, but at least we can try to move forward towards resolving.
Just noting in manual execution of https://github.com/pangeo-forge/staged-recipes/pull/97#issuecomment-990162893 today, this issue remains very much present.
KeyboardInterrupting and restarting the loop from the hang location is the workaround.Not really no. It’s more of a hope that #218 will close this.
I tried to follow that path a little - glad it’s not the case, as I was really puzzled!
Thanks for continuing to push on this Charles. I think we need to get to the bottom fo the issue with a minimal reproducer.
In the meantime, I plan to work this week on implementing the “reference” reading approach.
Absolutely not - they call libraries that are implemented in pure-async.
Two other thing that may be worth trying for the sake of elimination:
OK, so explicitly its the mixture of xarray/h5py/fsspec/threads. If gc doesn’t make a difference, then I am once again at a loss.