pangeo-forge-recipes: `finalize_target` is slow on large recipes with remote filesystem #987
finalize_target, specifically https://github.com/pangeo-forge/pangeo-forge-recipes/blob/4f8c0522c1395aaf15a820cfd637a1f1e76729b2/pangeo_forge_recipes/recipes/xarray_zarr.py#L558, can be slow.
We call set(group) to figure out what which variables are available. I didn’t realize it, but apparently that can trigger a full listing of the bucket / storage container, which is slow with many objects. (Zarr has a fastpath if the object implements listdirs https://github.com/zarr-developers/zarr-python/blob/f0677c2e051cad672e781c6bf63b5edfe57aaca4/zarr/storage.py#L163-L165), but fsspec mappers don’t; The Azure portal does it somehow, so it might be feasible)
I wonder if we can determine the set of variables based on some other mechanism. For now, I’m going to manually run finalize_target with hard-coded variables for the dataset I’m working on (gpm-imerg).
cc @sharkinsspatial, I think that this is one of the reasons finalize_target was taking so long for you.
edit: Oh I just discoved the walk_blobs method in azure.storage.blob. Perahps fsspec could use walk to implement listdirs. I’ll investigate and open an fsspec issue if it seems promising.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 17 (17 by maintainers)
Commits related to this issue
- Avoid costly `set(group)` operation in finalize_target. As pointed out in #254, taking a set difference happens to be an expensive operation for large datasets. Specifically, taking `set(group)` list... — committed to alxmrs/pangeo-forge-recipes by alxmrs 2 years ago
- Avoid costly `set(group)` operation in finalize_target. As pointed out in #254, taking a set difference happens to be an expensive operation for large datasets. Specifically, taking `set(group)` list... — committed to alxmrs/pangeo-forge-recipes by alxmrs 2 years ago
Trying to understand next steps here:
I think having both
fsspec.mapping.FSMapandzarr.storage.FSStoreis confusing. They both implement the mutable mapping interface and so both can be passed tozarr.open, but one might be better than the other. We should remove that confusion by having just one implementation. IMO, that implementation should probably be infsspec, but that doesn’t really matter to me.In the meantime, I think pangeo-forge-recipes should update FSSpecTarget.get_mapper. I think, however, FSSpecTarget will need to be updated to include the
storage_optionsused to create the fsspec filesystem instance (so that they can be passed through tozarr.storage.FSStoreas**storage_options).