pangeo-forge-recipes: Potentially (M)RE for 'hanging' during rechunk based on CMIP6 recipes.

As discussed in the last community meeting and on the call with @moradology and @ranchodeluxe today, I will try to boil down one of the CMIP6 feedstock examples that I observed to ‘hang’ (not completing within several hours, even though the caching was already complete!).

The dataset unique identifier is 'CMIP6.CMIP.CMCC.CMCC-ESM2.historical.r1i1p1f1.3hr.pr.gn.v20210114' and the file urls are:

['https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_185001010130-185412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_185501010130-185912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_186001010130-186412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_186501010130-186912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_187001010130-187412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_187501010130-187912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_188001010130-188412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_188501010130-188912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_189001010130-189412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_189501010130-189912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_190001010130-190412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_190501010130-190912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_191001010130-191412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_191501010130-191912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_192001010130-192412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_192501010130-192912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_193001010130-193412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_193501010130-193912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_194001010130-194412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_194501010130-194912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_195001010130-195412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_195501010130-195912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_196001010130-196412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_196501010130-196912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_197001010130-197412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_197501010130-197912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_198001010130-198412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_198501010130-198912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_199001010130-199412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_199501010130-199912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_200001010130-200412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_200501010130-200912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_201001010130-201412312230.nc']

A stripped down recipe to reproduce this could look like this:

iid = 'CMIP6.CMIP.CMCC.CMCC-ESM2.historical.r1i1p1f1.3hr.pr.gn.v20210114'

urls = [
'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_185001010130-185412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_185501010130-185912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_186001010130-186412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_186501010130-186912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_187001010130-187412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_187501010130-187912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_188001010130-188412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_188501010130-188912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_189001010130-189412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_189501010130-189912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_190001010130-190412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_190501010130-190912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_191001010130-191412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_191501010130-191912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_192001010130-192412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_192501010130-192912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_193001010130-193412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_193501010130-193912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_194001010130-194412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_194501010130-194912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_195001010130-195412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_195501010130-195912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_196001010130-196412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_196501010130-196912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_197001010130-197412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_197501010130-197912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_198001010130-198412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_198501010130-198912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_199001010130-199412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_199501010130-199912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_200001010130-200412312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_200501010130-200912312230.nc', 'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/CMCC/CMCC-ESM2/historical/r1i1p1f1/3hr/pr/gn/v20210114/pr_3hr_CMCC-ESM2_historical_r1i1p1f1_gn_201001010130-201412312230.nc']

pattern = pattern_from_file_sequence(
        urls,
        concat_dim='time'
        )
recipes= (
        f"Creating {iid}" >> beam.Create(pattern.items())
        | OpenURLWithFSSpec()
         # do not specify file type to accomodate both ncdf3 and ncdf4
        | OpenWithXarray(xarray_open_kwargs={"use_cftime":True})
        | Preprocessor()
        | StoreToZarr(
            store_name=f"{iid}.zarr",
            combine_dims=pattern.combine_dim_keys,
            dynamic_chunking_fn=dynamic_chunking_func,
            )
        | ConsolidateDimensionCoordinates()
        | ConsolidateMetadata()
        )

(to be revised, just leaving this here as a starting point)

@ranchodeluxe do you mind sending me the steps you use to run this on a jupyterhub?

Will fill this in as I go along.

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Reactions: 2
  • Comments: 42 (42 by maintainers)

Most upvoted comments

Some strategies to further simplify this:

  • Hardcode the target_chunks (and maybe try a bunch of different combos to see if some work, e.g. pure time chunking)
  • Reduce the number of files (is there some number that works)

I am sorry there must be something clearly wrong with the way I am testing this locally (on the LEAP hub). I just reran the example and we get the expected (2 fragement) logging call:

"Combining group = ((\'x\', 0),), containing fragments = [({Dimension(name=\'time\', operation=<CombineOp.CONCAT: 2>): IndexedPosition(value=2, indexed=True, dimsize=4), Dimension(name=\'x\', operation=<CombineOp.CONCAT: 2>): IndexedPosition(value=0, indexed=True, dimsize=2)}, <xarray.Dataset> Size: 40B\nDimensions:  (x: 1, time: 2)\nCoordinates:\n  * x        (x) int64 8B 0\n  * time     (time) int64 16B 2 3\nData variables:\n    var      (x, time) int64 16B ...), ({Dimension(name=\'time\', operation=<CombineOp.CONCAT: 2>): IndexedPosition(value=0, indexed=True, dimsize=4), Dimension(name=\'x\', operation=<CombineOp.CONCAT: 2>): IndexedPosition(value=0, indexed=True, dimsize=2)}, <xarray.Dataset> Size: 40B\nDimensions:  (x: 1, time: 2)\nCoordinates:\n  * x        (x) int64 8B 0\n  * time     (time) int64 16B 0 1\nData variables:\n    var      (x, time) int64 16B ...)]"

Somehow things are persisting across running the recipe multiple times from a local repo, which might be related to git or how things are saved/not saved on the hub 😡 . Strong point for what you said above about testing without beam!

I am very sorry for all the noise and loss of focus. I think I have convinced myself that the rechunking logic works as intended here and that the issue lies elsewhere.

can you confirm that the basic assumption here is correct:

Not without a simpler reproducer that just involves the functions split_fragment and combine_fragments. 😉 What you’re saying sounds generally correct, but there is so much complexity around the full end-to-end recipe that I hesitate to make any claims with certainty.

It is from the updated version with explicit chunks for all existing dimensions!

But then how is the dataset being written correctly? Could you show the full logging output?

I’d still encourage you to remove the PTransforms and Pipelines from the mix to focus on the core rechunking logic, which is quite simple:

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/55fff0b30ce47de7fde51fe9fe30195c0b7a637b/pangeo_forge_recipes/transforms.py#L395-L402

FWIW I think it would be helpful to keep working on a schematic like the one above and label the following terms in there:

  • Files
  • Fragments
  • Sub-Fragments
  • Groups (of fragments)
  • Target chunks

I might have missed/double counted some terms, but having those enumerated might help with future debugging/refactoring.

File “/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/asyn.py”, line 91, in sync if event.wait(1):

File “/usr/local/lib/python3.10/threading.py”, line 607, in wait signaled = self._cond.wait(timeout)

File “/usr/local/lib/python3.10/threading.py”, line 324, in wait gotit = waiter.acquire(True, timeout)

Given this beautiful traceback I say the next steps to reproduce are:

  1. make sure the gcfs cache is enable
  2. write a simpler recipe that just loops through all gcsf files and does either only or some partial steps from combine_fragments to mimic that data read pattern
  3. if the first two steps can reproduce the hang then maybe we play with disabling fsspec caching

From the worker logs of the OG test_short_dynamic_chunks:

Operation ongoing in bundle process_bundle-2974329341843856250-42 for PTransform{name=Creating CMIP6.CMIP.CMCC.CMCC-ESM2.historical.r1i1p1f1.3hr.pr.gn.v20210114|OpenURLWithFSSpec|OpenWithXarray|Preprocessor|StoreToZarr|ConsolidateDimensionCoordinates|ConsolidateMetadata/StoreToZarr/Rechunk/MapTuple(combine_fragments)-ptransform-46, state=process-msecs} for at least 5827.01 seconds without outputting or completing.
Current Traceback:
  File "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()

  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()

  File "/usr/local/lib/python3.10/site-packages/apache_beam/utils/thread_pool_executor.py", line 53, in run
    self._work_item.run()

  File "/usr/local/lib/python3.10/site-packages/apache_beam/utils/thread_pool_executor.py", line 37, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 385, in task
    self._execute(

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 311, in _execute
    response = task()

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 386, in <lambda>
    lambda: self.create_worker().do_instruction(request), request)

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 650, in do_instruction
    return getattr(self, request_type)(

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker.py", line 688, in process_bundle
    bundle_processor.process_bundle(instruction_id))

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1113, in process_bundle
    input_op_by_transform_id[element.transform_id].process_encoded(

  File "/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/bundle_processor.py", line 237, in process_encoded
    self.output(decoded_value)

  File "/tmp/cf09f9f279ec355c109713fd5ed97e51c62f0daeb39deea8ba13555949327808qx0kye0e/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2040, in <lambda>

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/rechunking.py", line 240, in combine_fragments
    ds_combined = xr.combine_nested(dsets_to_concat, concat_dim=concat_dims_sorted)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/combine.py", line 577, in combine_nested
    return _nested_combine(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/combine.py", line 356, in _nested_combine
    combined = _combine_nd(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/concat.py", line 250, in concat
    return _dataset_concat(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/concat.py", line 631, in _dataset_concat
    combined_var = concat_vars(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/variable.py", line 2925, in concat
    return Variable.concat(variables, dim, positions, shortcut, combine_attrs)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/variable.py", line 1693, in concat
    data = duck_array_ops.concatenate(arrays, axis=axis)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 356, in concatenate
    return _concatenate(as_shared_dtype(arrays), axis=axis)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 232, in as_shared_dtype
    arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 232, in <listcomp>
    arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 219, in asarray
    return data if is_duck_array(data) else xp.asarray(data)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 693, in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 696, in get_duck_array
    self._ensure_cached()

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 690, in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 664, in get_duck_array
    return self.array.get_duck_array()

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 551, in get_duck_array
    array = self.array[self.key]

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 51, in __getitem__
    return indexing.explicit_indexing_adapter(

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/core/indexing.py", line 858, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/xarray/backends/h5netcdf_.py", line 58, in _getitem
    return array[key]

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/h5netcdf/core.py", line 347, in __getitem__
    return self._h5ds[key]

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/h5py/_hl/dataset.py", line 758, in __getitem__
    return self._fast_reader.read(args)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1856, in readinto
    data = self.read(out.nbytes)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1846, in read
    out = self.cache._fetch(self.loc, self.loc + length)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/caching.py", line 189, in _fetch
    self.cache = self.fetcher(start, end)  # new block replaces old

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/gcsfs/core.py", line 1921, in _fetch_range
    return self.gcsfs.cat_file(self.path, start=start, end=end)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)

  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/asyn.py", line 91, in sync
    if event.wait(1):

  File "/usr/local/lib/python3.10/threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)

  File "/usr/local/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)

I am also calling that the short example is in fact stuck! image That is nice and reduces the size of the problem considerably!

Got 4 examples in the oven image

I ll have to wait ~1 hour to be sure things are stuck.

Ok this is interesting. It seems that even only using 3 files (of 33) is enough to trigger this! If that holds true this would make this example a lot more minimal.