kerchunk: [BUG] MultiZarrToZarr fails to concatenate dimension coordinate variables
When combining along a new dimension using coo_map, MultiZarrToZarr fails to concatenate dimension coordinate variables, despite concatenating coordinate variables just fine.
Minimal example:
from kerchunk.hdf import SingleHdf5ToZarr
# Set up some fake netCDF files containing the dimension coordinate 'time'
time1 = xr.Dataset(coords={'time': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time': ('time', [4, 5, 6])})
time1.to_netcdf('test1.nc')
time2.to_netcdf('test2.nc')
# open both files using kerchunk
single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=300).translate() for filepath in ['./test1.nc', './test2.nc']]
# combine along new dimension 'id`
mzz = MultiZarrToZarr(
single_jsons,
concat_dims=["id"],
coo_map={'id': [10, 20]},
)
combined_test_json = mzz.translate()
# open with xarray to see what the result was
combined_test = xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={
"storage_options": {
"fo": combined_test_json,
},
"consolidated": False,
}
)
combined_test
This is not what I expected - the time variable should have dimensions (time, id) - we’ve lost half the time values. The variable time should have been concatenated along id because I did not specify it in identical_dims.
What’s weird is that this works as expected for coordinate variables, just not for dimension coordinates. In other words, if I rename the time variable to time_renamed, but have it still be a function of a dimension named time, then the concatenation happens as expected:
time1 = xr.Dataset(coords={'time_renamed': ('time', [1, 2, 3])})
time2 = xr.Dataset(coords={'time_renamed': ('time', [4, 5, 6])})
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 15 (8 by maintainers)
(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)
I assume your original situation was thus, but I would explicitly try also having a variable that depends on [id, time] to see if that matters.
This might be the critical thing. Essentially, we want to treat
timelike a variable for concat purposes, but maintain its attributes such that it is still a coordinate later. It seems like the code treats these two facets the same.