xarray: Segfault writing large netcdf files to s3fs

What happened?

It seems netcdf4 does not work well currently with s3fs the FUSE filesystem layer over S3 compatible storage with either the default netcdf4 engine nor with the h5netcdf.

Here is an example

import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')

On my system this code crashes with NTIMES=48, but completes without an error with NTIMES=24.

The output with NTIMES=48 is

There are 1 HDF5 objects open!

Report: open objects on 72057594037927936
Segmentation fault (core dumped)

I have tried the other engine that handles NETCDF4 in xarray with engine='h5netcdf' and also got a segfault.

A quick workaround seems to be to use the local filesystem to write the NetCDF file and then move the complete file to S3.

ds.to_netcdf(path='/tmp/test_netcdf.nc', format='NETCDF4', mode='w')
shutil.move('/tmp/test_netcdf.nc', '/my_s3_fs/test_netcdf.nc')

There are several pieces of software involved here: the xarray package (0.16.1), netcdf4 (1.5.4), HDF5 (1.10.6), and s3fs (1.79). If this is not a bug in my code but in the underlying libraries, most likely it is not an xarray bug, but since it fails with both Netcdf4 engines, I decided to report it here.

What did you expect to happen?

With NTIMES=24 I am getting a file /my_s3_fs/test_netcdf.nc of about 7.8 MBytes. WIth NTIMES=36 I get an empty file. I would expect to have this code run without a segfault and produce a nonempty file.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr
from datetime import datetime, timedelta
NTIMES=48
start = datetime(2022,10,6,0,0)
time_vals = [start + timedelta(minutes=20*t) for t in range(NTIMES)]
times = xr.DataArray(data = [t.strftime('%Y%m%d%H%M%S').encode() for t in time_vals], dims=['Time'])
v1 = xr.DataArray(data=np.zeros((len(times), 201, 201)), dims=['Time', 'x', 'y'])
ds = xr.Dataset(data_vars=dict(times=times, v1=v1))
ds.to_netcdf(path='/my_s3_fs/test_netcdf.nc', format='NETCDF4', mode='w')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

There are 1 HDF5 objects open!

Report: open objects on 72057594037927936
Segmentation fault (core dumped)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.8.3 | packaged by conda-forge | (default, Jun 1 2020, 17:43:00) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: 1.0.2 h5py: 3.1.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: None matplotlib: 3.3.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 50.3.0.post20201006 pip: 20.2.3 conda: 22.9.0 pytest: 6.1.1 IPython: 7.18.1 sphinx: None

About this issue

Most upvoted comments

Since we have eliminated xarray with this, you should be able to submit an issue to the h5py issue tracker while mentioning that this is probably a bug in libhdf5 since netcdf4 also fails with the same error (and you can also link this issue for more information)

I had to change ints and floats to doubles to reproduce the issue.

import h5py

N_TIMES = 48
with h5py.File("/my_s3_fs/test.nc", mode="w") as f:
    time = f.create_dataset("time", (N_TIMES,), dtype="d")
    time[:] = 0

    d1 = f.create_dataset("d1", (N_TIMES, 201, 201), dtype="d")
    d1[:] = 0

Can confirm the issue with xarray 2022.6.0 and dask 2022.9.2. The latest versions available on conda-forge. The issue might be related to netcdf4 and hdf5 libraries. Will try to update this as well.