kerchunk: [BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables

When combining along a new dimension using coo_map, MultiZarrToZarr fails to concatenate dimension coordinate variables, despite concatenating coordinate variables just fine.

Minimal example:

from kerchunk.hdf import SingleHdf5ToZarr


# Set up some fake netCDF files containing the dimension coordinate 'time'
time1 = xr.Dataset(coords={'time': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time': ('time', [4, 5, 6])})
time1.to_netcdf('test1.nc')
time2.to_netcdf('test2.nc')


# open both files using kerchunk
single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=300).translate() for filepath in ['./test1.nc', './test2.nc']]

# combine along new dimension 'id`
mzz = MultiZarrToZarr(
    single_jsons,
    concat_dims=["id"],
    coo_map={'id': [10, 20]},
)
combined_test_json = mzz.translate()

# open with xarray to see what the result was
combined_test = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": combined_test_json,
        },
        "consolidated": False,
    }
)
combined_test

Screenshot from 2023-11-01 16-55-25

This is not what I expected - the time variable should have dimensions (time, id) - we’ve lost half the time values. The variable time should have been concatenated along id because I did not specify it in identical_dims.

What’s weird is that this works as expected for coordinate variables, just not for dimension coordinates. In other words, if I rename the time variable to time_renamed, but have it still be a function of a dimension named time, then the concatenation happens as expected:

time1 = xr.Dataset(coords={'time_renamed': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time_renamed': ('time', [4, 5, 6])})

Screenshot from 2023-11-01 16-59-50

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)

I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.

it must be more than just the _ARRAY_DIMENSIONS. I will look into that.

This might be the critical thing. Essentially, we want to treat time like a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.