zarr-python: Problems faced while storing onto Zarr store using ABSStore

# Your code here

import zarr
from azure.storage.blob import BlockBlobService

store = zarr.ABSStore(container='zarrstoreall', prefix='zarrstoreall',account_name='xxxx',account_key='xxxx', blob_service_kwargs={'is_emulated': False})

compressor = zarr.Blosc(cname='zstd', clevel=3)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}
ds.to_zarr(store=store, encoding=encoding, consolidated=True)

Problem description

I’m trying to use ABSStore to store a large XArray dataset onto a zarr store using blob store. (see the code in previous section). I am facing two issues currently:

  1. I am getting first some sort of network error when loading “certain” variables into the store: image

After some time passing I get this error: image

Needless to say with relatively smaller sizes of XArray datasets I did not face these issues.

I appreciate your kind attention.

Version and installation information

Please provide the following:

  • Value of zarr.__version__ = ‘2.3.2’
  • Value of numcodecs.__version__ = ‘0.6.4’
  • Version of Python interpreter = Python 3.7.3
  • Operating system (Linux/Windows/Mac) = Databricks Runtime Version 6.1 (includes Apache Spark 2.4.4, Scala 2.11)
  • How Zarr was installed (e.g., “using pip into virtual environment”, or “using conda”) !pip install zarr

Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr. pip freeze output: adal==1.2.2 asciitree==0.3.3 asn1crypto==0.24.0 azure==4.0.0 azure-applicationinsights==0.1.0 azure-batch==4.1.3 azure-common==1.1.23 azure-cosmosdb-nspkg==2.0.2 azure-cosmosdb-table==1.0.6 azure-datalake-store==0.0.48 azure-eventgrid==1.3.0 azure-graphrbac==0.40.0 azure-keyvault==1.1.0 azure-loganalytics==0.1.0 azure-mgmt==4.0.0 azure-mgmt-advisor==1.0.1 azure-mgmt-applicationinsights==0.1.1 azure-mgmt-authorization==0.50.0 azure-mgmt-batch==5.0.1 azure-mgmt-batchai==2.0.0 azure-mgmt-billing==0.2.0 azure-mgmt-cdn==3.1.0 azure-mgmt-cognitiveservices==3.0.0 azure-mgmt-commerce==1.0.1 azure-mgmt-compute==4.6.2 azure-mgmt-consumption==2.0.0 azure-mgmt-containerinstance==1.5.0 azure-mgmt-containerregistry==2.8.0 azure-mgmt-containerservice==4.4.0 azure-mgmt-cosmosdb==0.4.1 azure-mgmt-datafactory==0.6.0 azure-mgmt-datalake-analytics==0.6.0 azure-mgmt-datalake-nspkg==3.0.1 azure-mgmt-datalake-store==0.5.0 azure-mgmt-datamigration==1.0.0 azure-mgmt-devspaces==0.1.0 azure-mgmt-devtestlabs==2.2.0 azure-mgmt-dns==2.1.0 azure-mgmt-eventgrid==1.0.0 azure-mgmt-eventhub==2.6.0 azure-mgmt-hanaonazure==0.1.1 azure-mgmt-iotcentral==0.1.0 azure-mgmt-iothub==0.5.0 azure-mgmt-iothubprovisioningservices==0.2.0 azure-mgmt-keyvault==1.1.0 azure-mgmt-loganalytics==0.2.0 azure-mgmt-logic==3.0.0 azure-mgmt-machinelearningcompute==0.4.1 azure-mgmt-managementgroups==0.1.0 azure-mgmt-managementpartner==0.1.1 azure-mgmt-maps==0.1.0 azure-mgmt-marketplaceordering==0.1.0 azure-mgmt-media==1.0.0 azure-mgmt-monitor==0.5.2 azure-mgmt-msi==0.2.0 azure-mgmt-network==2.7.0 azure-mgmt-notificationhubs==2.1.0 azure-mgmt-nspkg==3.0.2 azure-mgmt-policyinsights==0.1.0 azure-mgmt-powerbiembedded==2.0.0 azure-mgmt-rdbms==1.9.0 azure-mgmt-recoveryservices==0.3.0 azure-mgmt-recoveryservicesbackup==0.3.0 azure-mgmt-redis==5.0.0 azure-mgmt-relay==0.1.0 azure-mgmt-reservations==0.2.1 azure-mgmt-resource==2.2.0 azure-mgmt-scheduler==2.0.0 azure-mgmt-search==2.1.0 azure-mgmt-servicebus==0.5.3 azure-mgmt-servicefabric==0.2.0 azure-mgmt-signalr==0.1.1 azure-mgmt-sql==0.9.1 azure-mgmt-storage==2.0.0 azure-mgmt-subscription==0.2.0 azure-mgmt-trafficmanager==0.50.0 azure-mgmt-web==0.35.0 azure-nspkg==3.0.2 azure-servicebus==0.21.1 azure-servicefabric==6.3.0.0 azure-servicemanagement-legacy==0.20.6 azure-storage-blob==1.5.0 azure-storage-common==1.4.2 azure-storage-file==1.4.0 azure-storage-queue==1.4.0 backcall==0.1.0 boto==2.49.0 boto3==1.9.162 botocore==1.12.163 certifi==2019.3.9 cffi==1.12.2 cftime==1.0.4.2 chardet==3.0.4 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 dask==2.9.0 decorator==4.4.0 docutils==0.14 fasteners==0.15 fsspec==0.6.1 idna==2.8 ipython==7.4.0 ipython-genutils==0.2.0 isodate==0.6.0 jedi==0.13.3 jmespath==0.9.4 kiwisolver==1.1.0 koalas==0.23.0 locket==0.2.0 matplotlib==3.0.3 monotonic==1.5 msrest==0.6.10 msrestazure==0.6.2 netCDF4==1.5.3 numcodecs==0.6.4 numpy==1.16.2 oauthlib==3.1.0 pandas==0.24.2 parso==0.3.4 partd==1.1.0 patsy==0.5.1 pexpect==4.6.0 pickleshare==0.7.5 prompt-toolkit==2.0.9 psycopg2==2.7.6.1 ptyprocess==0.6.0 pyarrow==0.13.0 pycparser==2.19 pycurl==7.43.0 Pygments==2.3.1 pygobject==3.20.0 PyJWT==1.7.1 pyOpenSSL==19.0.0 pyparsing==2.4.2 PySocks==1.6.8 python-apt==1.1.0b1+ubuntu0.16.4.5 python-dateutil==2.8.0 pytz==2018.9 requests==2.21.0 requests-oauthlib==1.3.0 s3transfer==0.2.1 scikit-learn==0.20.3 scipy==1.2.1 seaborn==0.9.0 six==1.12.0 ssh-import-id==5.5 statsmodels==0.9.0 toolz==0.10.0 traitlets==4.3.2 unattended-upgrades==0.1 urllib3==1.24.1 virtualenv==16.4.1 wcwidth==0.1.7 xarray==0.14.1 zarr==2.3.2

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 25 (18 by maintainers)

Most upvoted comments

I believe the first error is actually a warning, and occurs when Zarr looks for metadata files that do not exist. This has been solved in newer versions of the Azure SDK. I would try upgrading azure-storage-blob to v2.1.

It’s worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It’s not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually.

Re: getting NaN values

I think I might have found out why this happens, as I ran into this myself.

There is a fill_value attribute in zarr, which zarr uses to fill out missing chunks (see here).

Xarray uses this same attribute as the _FillValue attribute(see here) for decoding using the CF conventions, which is something quite different from filling out missing chunks.

@zarr-developers/core-devs Is this a correct interpretation? If so where should this be fixed? In xarray or in zarr?

@dokooh I fixed this temporarily by giving mask_and_scale=False to xr.open_zarr

I believe the first error is actually a warning, and occurs when Zarr looks for metadata files that do not exist. This has been solved in newer versions of the Azure SDK. I would try upgrading azure-storage-blob to v2.1.

It’s worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It’s not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually.

Agree with @tjcrone here, the first warning goes away after updating to the newer version.

As for the above error, I have faced various errors, mostly out of memory error(so it’s worth monitoring the memory of your device/vm while doing the above) but also the one above while transferring large amounts of netCDF data to zarr. My solution was to transfer the data to zarr “in parts”. It is easily possible now with xarray’s new “append” feature for zarr. You can use ds.to_zarr with mode='a' and also provide the dimension along which the data will be appended. See here: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html or here: https://github.com/pydata/xarray/pull/2706

This is a bit of a guess, but are you sure all of the input netcdf files are there? Errors suggest that during attempt to read netcdf input something is requested which does not exist.

On Thu, 12 Dec 2019, 12:47 Nima Dokoohaki, notifications@github.com wrote:

Your code here

import zarrfrom azure.storage.blob import BlockBlobService

store = zarr.ABSStore(container=‘zarrstoreall’, prefix=‘zarrstoreall’,account_name=‘xxxx’,account_key=‘xxxx’, blob_service_kwargs={‘is_emulated’: False})

compressor = zarr.Blosc(cname=‘zstd’, clevel=3) encoding = {vname: {‘compressor’: compressor} for vname in ds.data_vars} ds.to_zarr(store=store, encoding=encoding, consolidated=True)

Problem description

I’m trying to use ABSStore to store a large XArray onto a zarr store using blob store. (see the code in previous section). I am facing two issues currently:

I am getting first some sort of network error when loading “certain” variables into the store: [image: image] https://user-images.githubusercontent.com/164987/70712978-5c4f2280-1ce5-11ea-8fbe-cadffe2d20aa.png 2.

After some time passing I get this error: [image: image] https://user-images.githubusercontent.com/164987/70712591-6290cf00-1ce4-11ea-974c-df2615ea0a0a.png

Needless to say with relatively smaller sizes of XArray datasets I did not face these issues.

I appreciate your kind attention. Version and installation information

Please provide the following:

  • Value of zarr.version = ‘2.3.2’
  • Value of numcodecs.version = ‘0.6.4’
  • Version of Python interpreter = Python 3.7.3
  • Operating system (Linux/Windows/Mac) = Databricks Runtime Version 6.1 (includes Apache Spark 2.4.4, Scala 2.11)
  • How Zarr was installed (e.g., “using pip into virtual environment”, or “using conda”) !pip install zarr

Also, if you think it might be relevant, please provide the output from pip freeze or conda env export depending on which was used to install Zarr. pip freeze output: adal==1.2.2 asciitree==0.3.3 asn1crypto==0.24.0 azure==4.0.0 azure-applicationinsights==0.1.0 azure-batch==4.1.3 azure-common==1.1.23 azure-cosmosdb-nspkg==2.0.2 azure-cosmosdb-table==1.0.6 azure-datalake-store==0.0.48 azure-eventgrid==1.3.0 azure-graphrbac==0.40.0 azure-keyvault==1.1.0 azure-loganalytics==0.1.0 azure-mgmt==4.0.0 azure-mgmt-advisor==1.0.1 azure-mgmt-applicationinsights==0.1.1 azure-mgmt-authorization==0.50.0 azure-mgmt-batch==5.0.1 azure-mgmt-batchai==2.0.0 azure-mgmt-billing==0.2.0 azure-mgmt-cdn==3.1.0 azure-mgmt-cognitiveservices==3.0.0 azure-mgmt-commerce==1.0.1 azure-mgmt-compute==4.6.2 azure-mgmt-consumption==2.0.0 azure-mgmt-containerinstance==1.5.0 azure-mgmt-containerregistry==2.8.0 azure-mgmt-containerservice==4.4.0 azure-mgmt-cosmosdb==0.4.1 azure-mgmt-datafactory==0.6.0 azure-mgmt-datalake-analytics==0.6.0 azure-mgmt-datalake-nspkg==3.0.1 azure-mgmt-datalake-store==0.5.0 azure-mgmt-datamigration==1.0.0 azure-mgmt-devspaces==0.1.0 azure-mgmt-devtestlabs==2.2.0 azure-mgmt-dns==2.1.0 azure-mgmt-eventgrid==1.0.0 azure-mgmt-eventhub==2.6.0 azure-mgmt-hanaonazure==0.1.1 azure-mgmt-iotcentral==0.1.0 azure-mgmt-iothub==0.5.0 azure-mgmt-iothubprovisioningservices==0.2.0 azure-mgmt-keyvault==1.1.0 azure-mgmt-loganalytics==0.2.0 azure-mgmt-logic==3.0.0 azure-mgmt-machinelearningcompute==0.4.1 azure-mgmt-managementgroups==0.1.0 azure-mgmt-managementpartner==0.1.1 azure-mgmt-maps==0.1.0 azure-mgmt-marketplaceordering==0.1.0 azure-mgmt-media==1.0.0 azure-mgmt-monitor==0.5.2 azure-mgmt-msi==0.2.0 azure-mgmt-network==2.7.0 azure-mgmt-notificationhubs==2.1.0 azure-mgmt-nspkg==3.0.2 azure-mgmt-policyinsights==0.1.0 azure-mgmt-powerbiembedded==2.0.0 azure-mgmt-rdbms==1.9.0 azure-mgmt-recoveryservices==0.3.0 azure-mgmt-recoveryservicesbackup==0.3.0 azure-mgmt-redis==5.0.0 azure-mgmt-relay==0.1.0 azure-mgmt-reservations==0.2.1 azure-mgmt-resource==2.2.0 azure-mgmt-scheduler==2.0.0 azure-mgmt-search==2.1.0 azure-mgmt-servicebus==0.5.3 azure-mgmt-servicefabric==0.2.0 azure-mgmt-signalr==0.1.1 azure-mgmt-sql==0.9.1 azure-mgmt-storage==2.0.0 azure-mgmt-subscription==0.2.0 azure-mgmt-trafficmanager==0.50.0 azure-mgmt-web==0.35.0 azure-nspkg==3.0.2 azure-servicebus==0.21.1 azure-servicefabric==6.3.0.0 azure-servicemanagement-legacy==0.20.6 azure-storage-blob==1.5.0 azure-storage-common==1.4.2 azure-storage-file==1.4.0 azure-storage-queue==1.4.0 backcall==0.1.0 boto==2.49.0 boto3==1.9.162 botocore==1.12.163 certifi==2019.3.9 cffi==1.12.2 cftime==1.0.4.2 chardet==3.0.4 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 dask==2.9.0 decorator==4.4.0 docutils==0.14 fasteners==0.15 fsspec==0.6.1 idna==2.8 ipython==7.4.0 ipython-genutils==0.2.0 isodate==0.6.0 jedi==0.13.3 jmespath==0.9.4 kiwisolver==1.1.0 koalas==0.23.0 locket==0.2.0 matplotlib==3.0.3 monotonic==1.5 msrest==0.6.10 msrestazure==0.6.2 netCDF4==1.5.3 numcodecs==0.6.4 numpy==1.16.2 oauthlib==3.1.0 pandas==0.24.2 parso==0.3.4 partd==1.1.0 patsy==0.5.1 pexpect==4.6.0 pickleshare==0.7.5 prompt-toolkit==2.0.9 psycopg2==2.7.6.1 ptyprocess==0.6.0 pyarrow==0.13.0 pycparser==2.19 pycurl==7.43.0 Pygments==2.3.1 pygobject==3.20.0 PyJWT==1.7.1 pyOpenSSL==19.0.0 pyparsing==2.4.2 PySocks==1.6.8 python-apt==1.1.0b1+ubuntu0.16.4.5 python-dateutil==2.8.0 pytz==2018.9 requests==2.21.0 requests-oauthlib==1.3.0 s3transfer==0.2.1 scikit-learn==0.20.3 scipy==1.2.1 seaborn==0.9.0 six==1.12.0 ssh-import-id==5.5 statsmodels==0.9.0 toolz==0.10.0 traitlets==4.3.2 unattended-upgrades==0.1 urllib3==1.24.1 virtualenv==16.4.1 wcwidth==0.1.7 xarray==0.14.1 zarr==2.3.2

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-python/issues/528?email_source=notifications&email_token=AAFLYQQKRK6GKXP5NDWXPQLQYIXHRA5CNFSM4JZ577ZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IABKMLQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLYQSOQYKXQDVLIZVRHC3QYIXHRANCNFSM4JZ577ZA .