pangeo-forge-recipes: Intermittent hanging when opening cached files without failure.

In recent attempts at running the SMAP SSS Recipe we have noted intermittent hanging when opening from cache via _maybe_open_or_copy_to_local. These are dask worker log entries from an recently attempted run where workers were hung

INFO:pangeo_forge_recipes.recipes.xarray_zarr:Opening 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/smap/L3/RSS/V4/monthly/SCI/2015//RSS_smap_SSS_L3_monthly_2015_08_FNL_v04.0.nc' from cache
INFO:pangeo_forge_recipes.recipes.xarray_zarr:Opening 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/smap/L3/RSS/V4/monthly/SCI/2015//RSS_smap_SSS_L3_monthly_2015_11_FNL_v04.0.nc' from cache

Interestingly, though the files which hang for cache retrieval are different between tests but consistently 3 workers seem to hang. Not sure if this indicates some type of concurrency issue with cache retrieval.

Besides blocking the recipe, this is problematic due to the asynchronous nature of these operations which can potentially leave workers in an idle healthy state and prevent cleanup from the scheduler.

@rabernat I had look over the storage refactoring PR but I’m unsure if this might potentially address this issue? Do you have recommendations on any additional logging or telemetry you would like me to collect that might be helpful in diagnosing the root cause for this?

If you feel the storage refactoring PR might address this issue, can we craft a new release which includes this so it can be deployed in our staged-recipes CI workflows.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (27 by maintainers)

Most upvoted comments

FYI, we believe that @martindurant just fixed the root issue in #150.

🎊 @rabernat I ran a reduced dimension set of noaa-oisst recipe successfully with copy_input_to_local_file=True configured. Once question here is if we have any heuristics on potential max required temporary storage? In the AWS bakery, the workers are configured with a default of 20GB. This is expandable to 200GB (which could be configured via a recipe’s meta.yaml).