xarray: `to_zarr` with append or region mode and `_FillValue` doesnt work
What happened?
import numpy as np
import xarray as xr
ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim="x")
raises
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
What did you expect to happen?
I’d expect this to just work (effectively concatenating the dataset to itself).
Anything else we need to know?
appears also for region
writes
The same issue appears for region writes as in:
import numpy as np
import dask.array as da
import xarray as xr
ds = xr.Dataset({"a": ("x", da.array([3.,4.]), {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m, compute=False, encoding={"a": {"chunks": (1,)}})
ds.isel(x=slice(0,1)).to_zarr(m, region={"x": slice(0,1)})
raises
ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.
there’s a workaround
The workaround (deleting the _FillValue
in subsequent writes):
m = {}
ds.to_zarr(m)
del ds.a.attrs["_FillValue"]
ds.to_zarr(m, append_dim="x")
seems to do the trick.
There are indications that the result might still be broken, but it’s not yet clear how to reproduce them (see comments below).
This issue has been split off from #6069
Environment
INSTALLED VERSIONS
commit: None python: 3.9.10 (main, Jan 15 2022, 11:48:00) [Clang 13.0.0 (clang-1300.0.29.3)] python-bits: 64 OS: Darwin OS-release: 20.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: (‘de_DE’, ‘UTF-8’) libhdf5: 1.12.0 libnetcdf: 4.7.4
xarray: 0.20.1 pandas: 1.2.0 numpy: 1.21.2 scipy: 1.6.2 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.11.0 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2021.11.1 distributed: 2021.11.1 matplotlib: 3.4.1 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 60.5.0 pip: 21.3.1 conda: None pytest: 6.2.2 IPython: 8.0.0.dev sphinx: 3.5.0
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 17 (7 by maintainers)
Thanks for pointing out
region
again. I’ve updated the header and the initial comment.Yes, this is kind of the behaviour I’d expect. And great that it helped clarifying things. Still, building up the metadata nicely upfront (which is required for region writes) ist quite convoluted… That’s what I meant with
in the previous comment. I think, establishing and documenting good practices for this would help, but probably we also want to have better tools. In any case, this would probably be yet another issue.
Note that if you care about this paricular example (e.g. appending in a single thread in increasing order of timesteps), then it should also be possible to do this much simpler using append:
If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine?