xarray: problem appending to zarr on GCS when using json token

What happened: Appending a toy dataset to an existing zarr store in GCS along the time dimension leaves the store unchanged.

What you expected to happen: The store to double in length, because I was appending a dataset with a length of 3 in the time dimension, to another dataset of the same size.

Minimal Complete Verifiable Example: To reproduce fully you will need the token, but maybe people can try using their own token.

import fsspec
import xarray as xr
import json
import gcsfs


## define a mapper to the ldeo-glaciology bucket
### needs a token
with open('../secrets/ldeo-glaciology-bc97b12df06b.json') as token_file:
    token = json.load(token_file)

mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test5', mode='w', token=token)

## define two simple datasets
ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': [1, 2, 3]})
ds1 = xr.Dataset({'temperature': (['time'],  [53, 54, 55])}, coords={'time':  [4, 5, 6]})

## write the fist to bucket
ds0.to_zarr(mapper)
## append the second to the same zarr store
ds1.to_zarr(mapper, mode='a', append_dim='time')

## load the zarr store
ds_both = xr.open_zarr(mapper)

## this is 3, indicating that the append did not work 
len(ds_both.time)

Anything else we need to know?: It works as expected if you instead write and append to the pangeo scratch bucket, i.e if you replace

with open('../secrets/ldeo-glaciology-bc97b12df06b.json') as token_file:
    token = json.load(token_file)

mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test5', mode='w', token=token)

with

mapper = fsspec.get_mapper('gs://pangeo-scratch/jkingslake/append_test/test3', mode='w', token=None)

It also works as expected if I write and append to a local zarr.

Thanks for your help!

Environment: https://us-central1-b.gcp.pangeo.io/

Output of <tt>xr.show_versions()</tt>

INSTALLED VERSIONS

commit: None python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:21:18) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.129+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.16.2 pandas: 1.2.1 numpy: 1.20.0 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: installed h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: 2.6.1 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.2.0 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20210108 pip: 20.3.4 conda: None pytest: None IPython: 7.20.0 sphinx: 3.4.3

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 16 (6 by maintainers)

Most upvoted comments

I think that this is not an issue with xarray, zarr, or anything in python world but rather an issue with how caching works on GCS public buckets: https://cloud.google.com/storage/docs/metadata

To test this, forget about xarray and zarr for a minute and just use gcsfs to list the bucket contents before and after your writes. I think you will find that the default cache lifetime of 3600 seconds means that you cannot “see” the changes to the bucket or the objects as quickly as needed in order to append.